SlideShare a Scribd company logo
SSD Deployment Strategies for MySQL Yoshinori Matsunobu Lead of MySQL Professional Services APAC Sun Microsystems Yoshinori.Matsunobu@sun.com Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 1
What do you need to consider? (H/W layer) • SSD or HDD? • Interface – SATA/SAS or PCI-Express? • RAID – H/W RAID, S/W RAID or JBOD? • Network – Is 1GbE enough? • Memory – Is 2GB RAM + PCI-E SSD faster than 64GB RAM + 8HDDs? • CPU – Nehalem or older Xeon? Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 2
What do you need to consider? • Redundancy – RAID – DRBD (network mirroring) – Semi-Sync MySQL Replication – Async MySQL Replication • Filesystem – ext3, xfs, raw device ? • File location – Data file, Redo log file, etc • SSD specific issues – Write performance deterioration – Write endurance Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 3
Why SSD? IOPS! • IOPS: Number of (random) disk i/o operations per second • Almost all database operations require random access – Selecting records by index scan – Updating records – Deleting records – Modifying indexes • Regular SAS HDD : 200 iops per drive (disk seek & rotation is slow) • SSD : 2,000+ (writes) / 5,000+ (reads) per drive – highly depending on SSDs and device drivers • Let’s start from basic benchmarks Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 4
Tested HDD/SSD for this session • SSD – Intel X25-E (SATA, 30GB, SLC) – Fusion I/O (PCI-Express, 160GB, SLC) • HDD – Seagate 160GB SAS 15000RPM Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 5
Table of contents • Basic Performance on SSD/HDD – Random Reads – Random Writes – Sequential Reads – Sequential Writes – fsync() speed – Filesystem difference – IOPS and I/O unit size • MySQL Deployments Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 6
Random Read benchmark Direct Random Read IOPS (Single Drive, 16KB, xfs) 45000 40000 35000 30000 25000 HDD IOPS 20000 Intel SSD 15000 Fusion I/O 10000 5000 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 # of I/O threads • HDD: 196 reads/s at 1 i/o thread, 443 reads/s at 100 i/o threads • Intel : 3508 reads/s at 1 i/o thread, 14538 reads/s at 100 i/o threads • Fusion I/O : 10526 reads/s at 1 i/o thread, 41379 reads/s at 100 i/o threads • Single thread throughput on Intel is 16x better than on HDD, Fusion is 25x better • SSD’s concurrency (4x) is much better than HDD’s (2.2x) • Very strong reason to use SSD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 7
High Concurrency • Single SSD drive has multiple NAND Flash Memory chips (i.e. 40 x 4GB Flash Memory = 160GB) • Highly depending on I/O controller and Applications – Single threaded application can not gain concurrency advantage Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 8
PCI-Express SSD CPU North Bridge South Bridge PCI-Express Controller SAS/SATA Controller 2GB/s (PCI-Express x 8) 300MB/s SSD I/O Controller SSD I/O Controller Flash Flash • Advantage – PCI-Express is much faster interface than SAS/SATA • (current) Disadvantages – Most motherboards have limited # of PCI-E slots – No hot swap mechanism Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 9
Write performance on SSD Random Write IOPS (16KB Blocks) 20000 18000 16000 14000 12000 1 i/o thread 10000 100 i/o threads 8000 6000 4000 2000 0 HDD(4 RAID10 xfs) Intel(xfs) Fusion (xfs) • Very strong reason to use SSD • But wait.. Can we get a high write throughput *anytime*? – Not always.. Let’s check how data is written to Flash Memory Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 10
Understanding how data is written to SSD (1) Block (empty) Block (empty) Block (empty) Block Page Page …. Flash memory chips • Single SSD drive consists of many flash memory chips (i.e. 2GB) • A flash memory chip internally consists of many blocks (i.e. 512KB) • A block internally consists of many pages (i.e. 4KB) • It is *not* possible to overwrite to a non-empty block – Reading from pages is possible – Writing to pages in an empty block is possible – Appending is possible – Overwriting to pages in a non-empty block is *not* possible Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 11
Understanding how data is written to SSD (2) Block (empty) Block (empty) New data Block (empty) Block × Page Page …. • Overwriting to a non-empty block is not possible • Writing new data to an empty block instead • Writing to a non-empty block is fast (-200 microseconds) • Even though applications write to same positions in same files (i.e. InnoDB Log File), written pages/blocks are distributed (Wear-Leveling) Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 12
Understanding how data is written to SSD (3) Block P Block P Block P P P P 1. Reading all pages Block P Block P Block P P P New P 2. Erasing the block Block Block P Block P P P P 3. Writing all data P P • In the long run, almost all blocks will be fully used New P – i.e. Allocating 158GB files on 160GB SSD • New empty block must be allocated on writes • Basic steps to write new data: – 1. Reading all pages from a block – 2. ERASE the block – 3. Writing all data w/ new data into the block • ERASE is very expensive operation (takes a few milliseconds) • At this stage, write performance becomes very slow because of massive ERASE operations Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 13
Data Space Reserved Space Reserved Space Block P Block P Block P Block (empty) P P P Block P Block P Block P Block (empty) P P P Block Block P Block P 2. Writing data P P P 1. Reading pages P New data Background jobs ERASE unused blocks P • To keep high enough write performance, SSDs have a feature of “reserved space” • Data size visible to applications is limited to the size of data space – i.e. 160GB SSD, 120GB data space, 40GB reserved space • Fusion I/O has a functionality to change reserved space size – # fio-format -s 96G /dev/fct0 Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 14
Write performance deterioration Write IOPS deterioration (16KB random write) 30000 Continuous write-intensive workloads 25000 20000 IOPS Fastest 15000 Slowest 10000 5000 Stopping writing for a while 0 Intel Fusion(150G) Fusion(120G) Fusion(96G) Fusion(80G) • At the beginning, write IOPS was close to “Fastest” line • When massive writes happened, write IOPS gradually deteriorated toward “Slowest” line (because massive ERASE happened) • Increasing reserved space improves steady-state write throughput • Write IOPS recovered to “Fastest” when stopping writes for a long time (Many blocks were ERASEd by background job) • Highly depending on Flash memory and I/O controller (TRIM support, ERASE scheduling, etc) Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 15
Sequential I/O Sequential Read/Write throughput (1MB consecutive reads/writes) 600 500 400 MB/s Seq read 300 Seq write 200 100 0 4 HDD(raid10, xfs) Intel(xfs) Fusion(xfs) • Typical scenario: Full table scan (read), logging/journaling (write) • SSD outperforms HDD for sequential reads, but less significant • HDD (4 RAID10) is fast enough for sequential i/o • Data transfer size by sequential writes tends to be huge, so you need to care about write deterioration on SSD • No strong reason to use SSD for sequential writes Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 16
fsync() speed fsync speed 20000 18000 16000 14000 fsync/sec 12000 1KB 10000 8KB 8000 16KB 6000 4000 2000 0 HDD(xfs) Intel (xfs) Fusion I/O(xfs) • 10,000+ fsync/sec is fine in most cases • Fusion I/O was CPU bound (%system), not I/O bound (%iowait). Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 17
HDD is fast for sequential writes / fsync • Best Practice: Writes can be boosted by using BBWC (Battery Backed up Write Cache), especially for REDO Logs (because it’s sequentially written) • No strong reason to use SSDs here seek & rotation time Write cache disk disk seek & rotation time Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 18
Filesystem matters Random write iops (16KB Blocks) 20000 18000 16000 14000 12000 1 thread iops 10000 8000 16 thread 6000 4000 2000 0 Fusion(ext3) Fusion (xfs) Fusion (raw) Filesystem • On xfs, multiple threads can write to the same file if opened with O_DIRECT, but can not on ext* • Good concurrency on xfs, close to raw device • ext3 is less optimized for Fusion I/O Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 19
Changing I/O unit size Read IOPS and I/O unit size (4 HDD RAID10) 2500 2000 1KB 1500 IOPS 4KB 1000 16KB 500 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 concurrency • On HDD, maximum 22% performance difference was found between 1KB and 16KB • No big difference when concurrency < 10 Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 20
Changing I/O unit size on SSD Read IOPS and I/O unit size (Fusion I/O) 200000 150000 1KB IOPS 100000 4KB 16KB 50000 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 concurrency • Huge difference • On SSDs, not only IOPS, but also I/O transfer size matters • It’s worth considering that Storage Engines support “configurable block size” functionality Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 21
Let’s start MySQL benchmarking • Base: Disk-bound application (DBT-2) running on: – Sun Fire X4270 – Nehalem 8 Core – 4 HDD – RAID1+0, Write Cache with Battery • What will happen if … – Replacing HDD with Intel SSD (SATA) – Replacing HDD with Fusion I/O (PCI-E) – Moving log files and ibdata to HDD – Not using Nehalem – Using two Fusion I/O drives with Software RAID1 – Deploying DRBD protocol B or C • Replacing 1GbE with 10GbE – Using MySQL 5.5.4 Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 22
DBT-2 condition • SuSE Enterprise Linux 11, xfs • MySQL 5.5.2M2 (InnoDB Plugin 1.0.6) • 200 Warehouses (20GB – 25GB hot data) • Buffer pool size – 1GB – 2GB – 5GB – 30GB (large enough to cache all data) • 1000 seconds warm up time • Running 3600 seconds (1 hour) • Fusion I/O: 96GB data space, 64GB reserved space Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 23
HDD vs Intel SSD HDD Intel Buffer pool 1G 1125.44 5709.06 (NOTPM: Transactions per minute) • Storing all data on HDD or Intel SSD • Massive disk i/o happens – Random reads for all accesses – Random writes for updating rows and indexes – Sequential writes for REDO log files, etc • SSD is very good at these kinds of workloads • 5.5 times performance improvement, without any application change! Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 24
HDD vs Intel SSD vs Fusion I/O HDD Intel Fusion I/O Buffer pool 1G 1125.44 5709.06 15122.75 • Fusion I/O is a PCI-E based SSD • PCI-E is much faster than SAS/SATA • 14x improvement compared to 4HDDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 25
Which should we spend money, RAM or SSD? HDD Intel Fusion I/O Buffer pool 1G 1125.44 5709.06 15122.75 Buffer pool 2G 1863.19 Buffer pool 5G 4385.18 Buffer pool 30G 36784.76 (Caching all hot data) • Increasing RAM (buffer pool size) reduces random disk reads – Because more data are cached in the buffer pool • If all data are cached, only disk writes (both random and sequential) happen • Disk writes happen asynchronously, so application queries can be much faster • Large enough RAM + HDD outperforms too small RAM + SSD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 26
Which should we spend money, RAM or SSD? HDD Intel Fusion I/O Buffer pool 1G 1125.44 5709.06 15122.75 Buffer pool 2G 1863.19 7536.55 20096.33 Buffer pool 5G 4385.18 12892.56 30846.34 Buffer pool 30G 36784.76 - 57441.64 (Caching all hot data) • It is not always possible to cache all hot data • Fusion I/O + good amount of memory (5GB) was pretty good • Basic rule can be: – If you can cache all active data, large enough RAM + HDD – If you can’t, or if you need extremely high throughput, spending on both RAM and SSD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 27
Let’s think about MySQL file location • SSD is extremely good at random reads • SSD is very good at random writes • HDD is good enough at sequential reads/writes • No strong reason to use SSD for sequential writes • Random I/O oriented: – Data Files (*.ibd) • Sequential reads if doing full table scan – Undo Log, Insert Buffer (ibdata) • UNDO tablespace (small in most cases, except for running long-running batch) • On-disk insert buffer space (small in most cases, except that InnoDB can not catch up with updating indexes) • Sequential Write oriented: – Doublewrite Buffer (ibdata) • Write volume is equal to *ibd files. Huge – Binary log (mysql-bin.XXXXXX) – Redo log (ib_logfile) – Backup files Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 28
Moving sequentially written files into HDD Fusion I/O Fusion I/O + HDD Up Buffer pool 1G 15122.75 19295.94 +28% (us=25%, wa=15%) (us=32%, wa=10%) Buffer pool 2G 20096.33 25627.49 +28% (us=30%, wa=12.5%) (us=36%, wa=8%) Buffer pool 5G 30846.34 39435.25 +28% (us=39%, wa=10%) (us=49%, wa=6%) Buffer pool 30G 57441.64 66053.68 +15% (us=70%, wa=3.5%) (us=77%, wa=1%) • Moving ibdata, ib_logfile, (+binary logs) into HDD • High impact on performance – Write volume to SSD becomes half because doublewrite area is allocated in HDD – %iowait was significantly reduced – You can delay write performance deterioration Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 29
Does CPU matter? Nehalem Older Xeon CPUs Memory CPUs QPI: 25.6GB/s FSB: 10.6GB/s North Bridge North Bridge (IOH) Memory (MCH) PCI-Express PCI-Express • Nehalem has two big advantages 1. Memory is directly attached to CPU : Faster for in-memory workloads 2. Interface speed between CPU and North Bridge is 2.5x higher, and interface traffics do not conflict with CPU<->Memory workloads : Faster for disk i/o workloads when using PCI-Express SSDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 30
Harpertown X5470 (older Xeon) vs Nehalem X5570 (HDD) HDD Harpertown X5470, Nehalem(X5570, Up 3.33GHz 2.93GHz) Buffer pool 1G 1135.37 (us=1%) 1125.44 (us=1%) -1% Buffer pool 2G 1922.23 (us=2%) 1863.19 (us=2%) -3% Buffer pool 5G 4176.51 (us=7%) 4385.18(us=7%) +5% Buffer pool 30G 30903.4 (us=40%) 36784.76 (us=40%) +19% us: userland CPU utilization • CPU difference matters on CPU bound workloads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 31
Harpertown X5470 vs Nehalem X5570 (Fusion) Fusion I/O+HDD Harportown X5470, Nehalem(X5570, Up 3.33GHz 2.93GHz) Buffer pool 1G 13534.06 (user=35%) 19295.94 (user=32%) +43% Buffer pool 2G 19026.64 (user=40%) 25627.49 (user=37%) +35% Buffer pool 5G 30058.48 (user=50%) 39435.25 (user=50%) +31% Buffer pool 30G 52582.71 (user=76%) 66053.68 (user=76%) +26% • TPM difference was much higher than HDD • For disk i/o bound workloads (buffer pool 1G/2G), CPU utilizations on Nehalem were smaller, but TPM were much higher – Verified that Nehalem is much more efficient for PCI-E workloads • Benefit from high interface speed between CPU and PCI-Express • Fusion I/O fits with Nehalem much better than with traditional CPUs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 32
We need to think about redundancy overhead • Single server + No RAID is meaningless in the real database world • Redundancy – RAID 1 / 5 / 10 – Network mirroring (DRBD) – Replication (Sync / Async) • Relative overhead for redundancy will be (much) higher than on traditional HDD environment Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 33
Fusion I/O + Software RAID1 • Fusion I/O itself has RAID5 feature – Writing parity bits into Flash Memory – Flash Chips are not Single Point of Failure – Controller / PCI-E Board is Single Point of Failure • Right now no H/W RAID controller is provided for PCI-E SSDs • Using Software RAID1 (or RAID10) – Two Fusion I/O drives in the same machine Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 34
Understanding how software RAID1 works H/W RAID1 App/DB S/W RAID1 App/DB Writing to files Writing to files on /dev/sdX Response Response on /dev/md0 Write cache with battery Software RAID daemon RAID controller “md0_raid1” process Background writes Writing to disks (in parallel) (in parallel) Disk1 Disk2 Disk1 Disk2 • Response time on Software RAID1 is max(time-to-write-to-disk1, time-to-write-to-disk2) • If either of the two takes time for ERASE, response time will be longer • On faster storages / faster writes (i.e. sequential write + fsync), relative overheads of the software raid process are higher Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 35
Random Write IOPS, S/W RAID1 vs No-RAID Random Write IOPS (Fusion I/O 160GB SLC, 16KB I/O unit, XFS) 50000 45000 40000 35000 No-RAID (120G) 30000 IOPS S/W RAID1 (120G) 25000 No-RAID (96G) 20000 15000 S/W RAID1 (96G) 10000 5000 0 1 61 121 181 241 301 361 421 481 Running time (minutes) • 120GB data space = 40GB additional reserved space • 96GB data space = 64GB additional reserved space • On S/W RAID1, IOPS deteriorated more quickly than on No-RAID • On S/W RAID1 with 96GB data space, the slowest line was smaller than No-RAID • 20-25% performance drop can be expected on disk write bound workloads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 36
What about Reads? Read IOPS (16KB Blocks) 80000 70000 60000 50000 IOPS No-RAID 40000 S/W RAID1 30000 20000 10000 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 concurrency • Theoretically reads IOPS can be twice by RAID1 • Peak IOPS was 43636 on No-RAID, 75627 on RAID, 73% up • Good scalability Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 37
DBT-2, No-RAID vs S/W RAID on Fusion I/O Fusion I/O+HDD RAID 1 Fusion %iowait Down I/O+HDD Buffer pool 1G 19295.94 15468.81 10% -19.8% Buffer pool 2G 25627.49 21405.23 8% -16.5% Buffer pool 5G 39435.25 35086.21 6-7% -11.0% Buffer pool 30G 66053.68 66426.52 0-1% +0.56% Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 38
Intel SSDs with a traditional H/W raid controller Single raw Intel Four RAID5 Intel Down Buffer pool 1G 5709.06 2975.04 -48% Buffer pool 2G 7536.55 4763.60 -37% Buffer pool 5G 12892.56 11739.27 -9% • Raw SSD drives performed much better than using a traditional H/W raid controller – Even on RAID10 performance was worse than single raw drive – H/W Raid controller seemed serious bottleneck – Make sure SSD drives have write cache and capacitor itself (Intel X25- V/M/E doesn’t have capacitor) • Use JBOD + write cache + capacitor • Research appliances such as Schooner, Gear6, etc • Wait until H/W vendors release great H/R raid controllers that work well with SSDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 39
What about DRBD? • Single server is not Highly Available – Mother Board/RAID Controller/etc are Single Point of Failure • Heartbeat + DRBD + MySQL is one of the most common HA (Active/Passive) solutions • Network might be a bottleneck – 1GbE -> 10GbE, InfiniBand, Dolphin Interconnect, etc • Replication level – Protocol A (async) – Protocol B (sync to remote drbd receiver process) – Protocol C (sync to remote disk) • Network channel is single threaded – Storing all data under /data (single DRBD partition) => single thread – Storing log/ibdata under /hdd, *ibd under /ssd => two threads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 40
DRBD Overheads on HDD HDD No DRBD DRBD Protocol DRBD Protocol B, B, 1GbE 10GbE Buffer pool 1G 1125.44 1080.8 1101.63 Buffer pool 2G 1863.19 1824.75 1811.95 Buffer pool 5G 4385.18 4285.22 4326.22 Buffer pool 30G 36784.76 32862.81 35689.67 • DRBD 8.3.7 • DRBD overhead (protocol B) was not big on disk i/o bound workloads • Network bandwidth difference was not big on disk i/o bound workloads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 41
DRBD Overheads on Fusion I/O Fusion I/O+HDD No DRBD DRBD Protocol Down DRBD Protocol Down B, 1GbE B, 10GbE Buffer pool 1G 19295.94 5976.18 -69.0% 12107.88 -37.3% Buffer pool 2G 25627.49 8100.5 -68.4% 16776.19 -34.5% Buffer pool 5G 39435.25 16073.9 -59.2% 30288.63 -23.2% Buffer pool 30G 66053.68 37974 -42.5% 62024.68 -6.1% • DRBD overhead was not negligible • 10GbE performed much better than 1GbE • Still 6-10 times faster than HDD • Note: DRBD supports faster interface such as InfiniBand SDP and Dolphin Interconnect Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 42
Misc topic: Insert performance on InnoDB vs MyISAM (HDD) Time to insert 1 million records (HDD) 5000 4000 250 rows/s Seconds 3000 innodb 2000 myisam 1000 0 1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 Existing records (millions) • MyISAM doesn’t do any special i/o optimization like “Insert Buffering” so a lot of random reads/writes happen, and highly depending on OS • Disk seek & rotation overhead is really serious on HDD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 43
Note: Insert Buffering (InnoDB feature) • If non-unique, secondary index blocks are not in memory, InnoDB inserts entries to a special buffer(“insert buffer”) to avoid random disk i/o operations – Insert buffer is allocated on both memory and innodb SYSTEM tablespace • Periodically, the insert buffer is merged into the secondary index trees in the database (“merge”) Insert buffer • Pros: Reducing I/O overhead – Reducing the number of disk i/o operations by merging i/o requests to the same block Optimized i/o – Some random i/o operations can be sequential • Cons: Additional operations are added Merging might take a very long time – when many secondary indexes must be updated and many rows have been inserted. – it may continue to happen after a server shutdown and restart Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 44
Insert performance: InnoDB vs MyISAM (SSD) Time to insert 1million records (SSD) 600 500 2,000 rows/s 400 Seconds InnoDB 300 MyISAM 200 5,000 rows/s 100 0 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 Existing records (millions) Index size exceeded buffer pool size Filesystem cache was fully used, disk reads began • MyISAM got much faster by just replacing HDD with SSD ! Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 45
Try MySQL 5.5.4 ! Fusion I/O + HDD MySQL5.5.2 MySQL5.5.4 Up Buffer pool 1G 19295.94 24019.32 +24% Buffer pool 2G 25627.49 32325.76 +26% Buffer pool 5G 39435.25 47296.12 +20 Buffer pool 30G 66053.68 67253.45 +1.8% • Got 20-26% improvements for disk i/o bound workloads on Fusion I/O – Both CPU %user and %iowait were improved • %user: 36% (5.5.2) to 44% (5.5.4) when buf pool = 2g • %iowait: 8% (5.5.2) to 5.5% (5.5.4) when buf pool = 2g, but iops was 20% higher – Could handle a lot more concurrent i/o requests in 5.5.4 ! – No big difference was found on 4 HDDs • Works very well on faster storages such as Fusion I/O, lots of disks Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 46
Conclusion for choosing H/W • Disks – PCI-E SSDs (i.e. Fusion I/O) perform very well – SAS/SATA SSDs (i.e. Intel X25) – Carefully research RAID controller. Many controllers do not scale with SSD drives – Keep enough reserved space if you need to handle massive write traffics – HDD is good at sequential writes • Use fast network adapter – 1GbE will be saturated on DRBD – 10GbE or Infiniband • Use Nahalem CPU – Especially when using PCI-Express SSDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 47
Conclusion for database deployments • Put sequentially written files on HDD – ibdata, ib_logfile, binary log files – HDD is fast enough for sequential writes – Write performance deterioration can be mitigated – Life expectancy of SSD will be longer • Put randomly accessed files on SSD – *ibd files, index files(MYI), data files(MYD) – SSD is 10x -100x faster for random reads than HDD • Archive less active tables/records to HDD – SSD is still much expensive than HDD • Use InnoDB Plugin – Higher scalability & concurrency matters on faster storage Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 48
What will happen in the real database world? • These are just my thoughts.. • Less demand for NoSQL – Isn’t it enough for many applications just to replace HDD with Fusion I/O? – Importance on functionality will be relatively stronger • Stronger demand for Virtualization – Single server will have enough capacity to run two or more mysqld instances • I/O volume matters – Not just IOPS – Block size, disabling doublewrite, etc • Concurrency matters – Single SSD scales as well as 8-16 HDDs – Concurrent ALTER TABLE, parallel query Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 49
Special Thanks To • Koji Watanabe – Fusion I/O Japan • Hideki Endo – Sumisho Computer Systems, Japan – Rent me two Fusion I/O 160GB SLC drives • Daisuke Homma, Masashi Hasegawa - Sun Japan – Did benchmarks together Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 50
Thanks for attending! • Contact: – E-mail: Yoshinori.Matsunobu@sun.com – Blog http://yoshinorimatsunobu.blogspot.com – @matsunobu on Twitter Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 51
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 52

More Related Content

What's hot (20)

Secrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on KubernetesSecrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on Kubernetes
Bruno Borges
 
NGINX Back to Basics Part 3: Security (Japanese Version)
NGINX Back to Basics Part 3: Security (Japanese Version)NGINX Back to Basics Part 3: Security (Japanese Version)
NGINX Back to Basics Part 3: Security (Japanese Version)
NGINX, Inc.
 
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
ScyllaDB
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)
Sid Anand
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
More mastering the art of indexing
More mastering the art of indexingMore mastering the art of indexing
More mastering the art of indexing
Yoshinori Matsunobu
 
Percona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimizationPercona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimization
mysqlops
 
Linux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLLinux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQL
Yoshinori Matsunobu
 
Introducing NGINX App Protect (Japanese Webinar)
Introducing NGINX App Protect (Japanese Webinar)Introducing NGINX App Protect (Japanese Webinar)
Introducing NGINX App Protect (Japanese Webinar)
NGINX, Inc.
 
Redis at LINE
Redis at LINERedis at LINE
Redis at LINE
LINE Corporation
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
Maksud Ibrahimov
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
New optimizer features in MariaDB releases before 10.12
New optimizer features in MariaDB releases before 10.12New optimizer features in MariaDB releases before 10.12
New optimizer features in MariaDB releases before 10.12
Sergey Petrunya
 
PgQ Generic high-performance queue for PostgreSQL
PgQ Generic high-performance queue for PostgreSQLPgQ Generic high-performance queue for PostgreSQL
PgQ Generic high-performance queue for PostgreSQL
elliando dias
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
ScyllaDB
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
An Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization SystemsAn Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization Systems
Databricks
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
Chris Baynes
 
The InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLThe InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQL
Morgan Tocker
 
Secrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on KubernetesSecrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on Kubernetes
Bruno Borges
 
NGINX Back to Basics Part 3: Security (Japanese Version)
NGINX Back to Basics Part 3: Security (Japanese Version)NGINX Back to Basics Part 3: Security (Japanese Version)
NGINX Back to Basics Part 3: Security (Japanese Version)
NGINX, Inc.
 
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
ScyllaDB
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)
Sid Anand
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
More mastering the art of indexing
More mastering the art of indexingMore mastering the art of indexing
More mastering the art of indexing
Yoshinori Matsunobu
 
Percona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimizationPercona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimization
mysqlops
 
Linux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLLinux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQL
Yoshinori Matsunobu
 
Introducing NGINX App Protect (Japanese Webinar)
Introducing NGINX App Protect (Japanese Webinar)Introducing NGINX App Protect (Japanese Webinar)
Introducing NGINX App Protect (Japanese Webinar)
NGINX, Inc.
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
Maksud Ibrahimov
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
New optimizer features in MariaDB releases before 10.12
New optimizer features in MariaDB releases before 10.12New optimizer features in MariaDB releases before 10.12
New optimizer features in MariaDB releases before 10.12
Sergey Petrunya
 
PgQ Generic high-performance queue for PostgreSQL
PgQ Generic high-performance queue for PostgreSQLPgQ Generic high-performance queue for PostgreSQL
PgQ Generic high-performance queue for PostgreSQL
elliando dias
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
ScyllaDB
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
An Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization SystemsAn Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization Systems
Databricks
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
Chris Baynes
 
The InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLThe InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQL
Morgan Tocker
 

Viewers also liked (20)

OPNAVINST 1420.1B (OFFICER PROGRAMS)
OPNAVINST 1420.1B (OFFICER PROGRAMS)OPNAVINST 1420.1B (OFFICER PROGRAMS)
OPNAVINST 1420.1B (OFFICER PROGRAMS)
A.J. Stone
 
Short Run Aggregate Supply (SRAS)
Short Run Aggregate Supply (SRAS)Short Run Aggregate Supply (SRAS)
Short Run Aggregate Supply (SRAS)
tutor2u
 
Ira y nefroproteccion
Ira y nefroproteccionIra y nefroproteccion
Ira y nefroproteccion
Justo Raúl Venereo Gutierrez
 
CTA vs SAP berbasis akrual
CTA vs SAP berbasis akrualCTA vs SAP berbasis akrual
CTA vs SAP berbasis akrual
odhemamad
 
LDO CWO applicantbrief
LDO CWO applicantbriefLDO CWO applicantbrief
LDO CWO applicantbrief
Tony Astro - Veteran Counselor & Entrepreneur
 
Fuel system,pptx
Fuel system,pptxFuel system,pptx
Fuel system,pptx
Vineet Garg
 
Inflation
InflationInflation
Inflation
lambavikash
 
AS Macro Revision Aggregate Supply
AS Macro Revision Aggregate SupplyAS Macro Revision Aggregate Supply
AS Macro Revision Aggregate Supply
tutor2u
 
A2 Economics Exam Technique - Weesteps to Evaluation
A2 Economics Exam Technique - Weesteps to EvaluationA2 Economics Exam Technique - Weesteps to Evaluation
A2 Economics Exam Technique - Weesteps to Evaluation
tutor2u
 
Unemployment
UnemploymentUnemployment
Unemployment
channyb
 
Memory management
Memory managementMemory management
Memory management
Rajni Sirohi
 
Chapter20 ppt
Chapter20 pptChapter20 ppt
Chapter20 ppt
Jake Quimby
 
Agrregate Demand and Supply
Agrregate Demand and SupplyAgrregate Demand and Supply
Agrregate Demand and Supply
Aleeza Baig
 
Joseph Kony and the LRA
Joseph Kony and the LRAJoseph Kony and the LRA
Joseph Kony and the LRA
christyleigh19
 
Macro diagrams and definitions
Macro diagrams and definitionsMacro diagrams and definitions
Macro diagrams and definitions
12jostma
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
Yoshinori Matsunobu
 
3.4 Demand And Supply Side Policies
3.4 Demand And Supply Side Policies3.4 Demand And Supply Side Policies
3.4 Demand And Supply Side Policies
Andrew McCarthy
 
3.3 Macro Economic Models
3.3 Macro Economic Models3.3 Macro Economic Models
3.3 Macro Economic Models
Andrew McCarthy
 
Furnace safegaurd supervisory system logic-1
Furnace safegaurd supervisory system logic-1Furnace safegaurd supervisory system logic-1
Furnace safegaurd supervisory system logic-1
Ashvani Shukla
 
Missles flight control systems
Missles flight control systemsMissles flight control systems
Missles flight control systems
กมลวรรณ เกตุดำ
 
OPNAVINST 1420.1B (OFFICER PROGRAMS)
OPNAVINST 1420.1B (OFFICER PROGRAMS)OPNAVINST 1420.1B (OFFICER PROGRAMS)
OPNAVINST 1420.1B (OFFICER PROGRAMS)
A.J. Stone
 
Short Run Aggregate Supply (SRAS)
Short Run Aggregate Supply (SRAS)Short Run Aggregate Supply (SRAS)
Short Run Aggregate Supply (SRAS)
tutor2u
 
CTA vs SAP berbasis akrual
CTA vs SAP berbasis akrualCTA vs SAP berbasis akrual
CTA vs SAP berbasis akrual
odhemamad
 
Fuel system,pptx
Fuel system,pptxFuel system,pptx
Fuel system,pptx
Vineet Garg
 
AS Macro Revision Aggregate Supply
AS Macro Revision Aggregate SupplyAS Macro Revision Aggregate Supply
AS Macro Revision Aggregate Supply
tutor2u
 
A2 Economics Exam Technique - Weesteps to Evaluation
A2 Economics Exam Technique - Weesteps to EvaluationA2 Economics Exam Technique - Weesteps to Evaluation
A2 Economics Exam Technique - Weesteps to Evaluation
tutor2u
 
Unemployment
UnemploymentUnemployment
Unemployment
channyb
 
Agrregate Demand and Supply
Agrregate Demand and SupplyAgrregate Demand and Supply
Agrregate Demand and Supply
Aleeza Baig
 
Joseph Kony and the LRA
Joseph Kony and the LRAJoseph Kony and the LRA
Joseph Kony and the LRA
christyleigh19
 
Macro diagrams and definitions
Macro diagrams and definitionsMacro diagrams and definitions
Macro diagrams and definitions
12jostma
 
3.4 Demand And Supply Side Policies
3.4 Demand And Supply Side Policies3.4 Demand And Supply Side Policies
3.4 Demand And Supply Side Policies
Andrew McCarthy
 
3.3 Macro Economic Models
3.3 Macro Economic Models3.3 Macro Economic Models
3.3 Macro Economic Models
Andrew McCarthy
 
Furnace safegaurd supervisory system logic-1
Furnace safegaurd supervisory system logic-1Furnace safegaurd supervisory system logic-1
Furnace safegaurd supervisory system logic-1
Ashvani Shukla
 

Similar to SSD Deployment Strategies for MySQL (20)

SSD PPT BY SAURABH
SSD PPT BY SAURABHSSD PPT BY SAURABH
SSD PPT BY SAURABH
Saurabh Kumar
 
SSD-Bondi.pptx
SSD-Bondi.pptxSSD-Bondi.pptx
SSD-Bondi.pptx
ssuserfc2c45
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
bergwolf
 
Solid state drives
Solid state drivesSolid state drives
Solid state drives
Manmath Agarwal
 
Top Technology Trends
Top Technology Trends Top Technology Trends
Top Technology Trends
InnoTech
 
IO Dubi Lebel
IO Dubi LebelIO Dubi Lebel
IO Dubi Lebel
sqlserver.co.il
 
Selection Of Perfect Memory for SOC design
Selection Of Perfect Memory for SOC designSelection Of Perfect Memory for SOC design
Selection Of Perfect Memory for SOC design
tusharchauhan96901
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Community
 
IBM Solid State in eX5 servers
IBM Solid State in eX5 serversIBM Solid State in eX5 servers
IBM Solid State in eX5 servers
Tony Pearson
 
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalk
I Goo Lee
 
002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptx002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptx
DrewMe1
 
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera..."The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
eLiberatica
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
JAXLondon2014
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
Uri Cohen
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)
MongoDB
 
Ssd And Enteprise Storage
Ssd And Enteprise StorageSsd And Enteprise Storage
Ssd And Enteprise Storage
Frank Zhao
 
Presentation database on flash
Presentation database on flashPresentation database on flash
Presentation database on flash
xKinAnx
 
Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014
Guy Harrison
 
Solid State Drive Technology - MIT Lincoln Labs
Solid State Drive Technology - MIT Lincoln LabsSolid State Drive Technology - MIT Lincoln Labs
Solid State Drive Technology - MIT Lincoln Labs
Matt Simmons
 
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Kyle Hailey
 
Top Technology Trends
Top Technology Trends Top Technology Trends
Top Technology Trends
InnoTech
 
Selection Of Perfect Memory for SOC design
Selection Of Perfect Memory for SOC designSelection Of Perfect Memory for SOC design
Selection Of Perfect Memory for SOC design
tusharchauhan96901
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Community
 
IBM Solid State in eX5 servers
IBM Solid State in eX5 serversIBM Solid State in eX5 servers
IBM Solid State in eX5 servers
Tony Pearson
 
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalk
I Goo Lee
 
002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptx002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptx
DrewMe1
 
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera..."The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
eLiberatica
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
JAXLondon2014
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
Uri Cohen
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)
MongoDB
 
Ssd And Enteprise Storage
Ssd And Enteprise StorageSsd And Enteprise Storage
Ssd And Enteprise Storage
Frank Zhao
 
Presentation database on flash
Presentation database on flashPresentation database on flash
Presentation database on flash
xKinAnx
 
Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014
Guy Harrison
 
Solid State Drive Technology - MIT Lincoln Labs
Solid State Drive Technology - MIT Lincoln LabsSolid State Drive Technology - MIT Lincoln Labs
Solid State Drive Technology - MIT Lincoln Labs
Matt Simmons
 
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Kyle Hailey
 

More from Yoshinori Matsunobu (11)

RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
Consistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced DurabilityConsistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced Durability
Yoshinori Matsunobu
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
Yoshinori Matsunobu
 
データベース技術の羅針盤
データベース技術の羅針盤データベース技術の羅針盤
データベース技術の羅針盤
Yoshinori Matsunobu
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話
Yoshinori Matsunobu
 
Introducing MySQL MHA (JP/LT)
Introducing MySQL MHA (JP/LT)Introducing MySQL MHA (JP/LT)
Introducing MySQL MHA (JP/LT)
Yoshinori Matsunobu
 
MySQL for Large Scale Social Games
MySQL for Large Scale Social GamesMySQL for Large Scale Social Games
MySQL for Large Scale Social Games
Yoshinori Matsunobu
 
Automated master failover
Automated master failoverAutomated master failover
Automated master failover
Yoshinori Matsunobu
 
ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計
Yoshinori Matsunobu
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)
Yoshinori Matsunobu
 
Linux/DB Tuning (DevSumi2010, Japanese)
 Linux/DB Tuning (DevSumi2010, Japanese) Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)
Yoshinori Matsunobu
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
Consistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced DurabilityConsistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced Durability
Yoshinori Matsunobu
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
Yoshinori Matsunobu
 
データベース技術の羅針盤
データベース技術の羅針盤データベース技術の羅針盤
データベース技術の羅針盤
Yoshinori Matsunobu
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話
Yoshinori Matsunobu
 
MySQL for Large Scale Social Games
MySQL for Large Scale Social GamesMySQL for Large Scale Social Games
MySQL for Large Scale Social Games
Yoshinori Matsunobu
 
ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計
Yoshinori Matsunobu
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)
Yoshinori Matsunobu
 
Linux/DB Tuning (DevSumi2010, Japanese)
 Linux/DB Tuning (DevSumi2010, Japanese) Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)
Yoshinori Matsunobu
 

Recently uploaded (20)

LVM Management & Disaster Recovery - RHCSA+.pdf
LVM Management & Disaster Recovery - RHCSA+.pdfLVM Management & Disaster Recovery - RHCSA+.pdf
LVM Management & Disaster Recovery - RHCSA+.pdf
RHCSA Guru
 
How PIM Improves Product Data Across All Sales Channels
How PIM Improves Product Data Across All Sales ChannelsHow PIM Improves Product Data Across All Sales Channels
How PIM Improves Product Data Across All Sales Channels
OEX Tech Solutions Pvt Ltd
 
Artificial Intelligence (AI) Security, Attack Vectors, Defense Techniques, Et...
Artificial Intelligence (AI) Security, Attack Vectors, Defense Techniques, Et...Artificial Intelligence (AI) Security, Attack Vectors, Defense Techniques, Et...
Artificial Intelligence (AI) Security, Attack Vectors, Defense Techniques, Et...
Salman Baset
 
The History of Artificial Intelligence: From Ancient Ideas to Modern Algorithms
The History of Artificial Intelligence: From Ancient Ideas to Modern AlgorithmsThe History of Artificial Intelligence: From Ancient Ideas to Modern Algorithms
The History of Artificial Intelligence: From Ancient Ideas to Modern Algorithms
isoftreview8
 
Jeremy Millul - A Junior Software Developer
Jeremy Millul - A Junior Software DeveloperJeremy Millul - A Junior Software Developer
Jeremy Millul - A Junior Software Developer
Jeremy Millul
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
BrainSell Technologies
 
Doctronic's 5M Seed Funding Pioneering AI-Powered Healthcare Solutions.pdf
Doctronic's 5M Seed Funding Pioneering AI-Powered Healthcare Solutions.pdfDoctronic's 5M Seed Funding Pioneering AI-Powered Healthcare Solutions.pdf
Doctronic's 5M Seed Funding Pioneering AI-Powered Healthcare Solutions.pdf
davidandersonofficia
 
Whitepaper-API-Design-Best-Practices. Prowess software services
Whitepaper-API-Design-Best-Practices. Prowess software servicesWhitepaper-API-Design-Best-Practices. Prowess software services
Whitepaper-API-Design-Best-Practices. Prowess software services
Prowess Software Services Inc
 
Next Generation of Developer by Ben Hicks
Next Generation of Developer by Ben HicksNext Generation of Developer by Ben Hicks
Next Generation of Developer by Ben Hicks
gdgcincy
 
oil seed milling- extraction and Refining
oil seed milling- extraction and Refiningoil seed milling- extraction and Refining
oil seed milling- extraction and Refining
MaheshKadam154653
 
Fault-tolerant, distrbuted AAA architecture supporting connectivity disruption
Fault-tolerant, distrbuted AAA architecture supporting connectivity disruptionFault-tolerant, distrbuted AAA architecture supporting connectivity disruption
Fault-tolerant, distrbuted AAA architecture supporting connectivity disruption
Karri Huhtanen
 
Teach the importance of logic (programming)in Computer Science and why it is ...
Teach the importance of logic (programming)in Computer Science and why it is ...Teach the importance of logic (programming)in Computer Science and why it is ...
Teach the importance of logic (programming)in Computer Science and why it is ...
Universidad Rey Juan Carlos
 
UiPath Community - AI Center and LLM.pdf
UiPath Community - AI Center and LLM.pdfUiPath Community - AI Center and LLM.pdf
UiPath Community - AI Center and LLM.pdf
UiPathCommunity
 
The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...
The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...
The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...
VictorSzoltysek
 
Autopilot for Everyone Series - Session 3: Exploring Real-World Use Cases
Autopilot for Everyone Series - Session 3: Exploring Real-World Use CasesAutopilot for Everyone Series - Session 3: Exploring Real-World Use Cases
Autopilot for Everyone Series - Session 3: Exploring Real-World Use Cases
UiPathCommunity
 
The Five Pillars of AI Readiness Webinar
The Five Pillars of AI Readiness WebinarThe Five Pillars of AI Readiness Webinar
The Five Pillars of AI Readiness Webinar
BrainSell Technologies
 
Introduction to LLM Post-Training - MIT 6.S191 2025
Introduction to LLM Post-Training - MIT 6.S191 2025Introduction to LLM Post-Training - MIT 6.S191 2025
Introduction to LLM Post-Training - MIT 6.S191 2025
Maxime Labonne
 
Managing Multiple Logical Volumes - RHCSA+.pdf
Managing Multiple Logical Volumes - RHCSA+.pdfManaging Multiple Logical Volumes - RHCSA+.pdf
Managing Multiple Logical Volumes - RHCSA+.pdf
RHCSA Guru
 
What is Agnetic AI : An Introduction to AI Agents
What is Agnetic AI : An Introduction to AI AgentsWhat is Agnetic AI : An Introduction to AI Agents
What is Agnetic AI : An Introduction to AI Agents
Techtic Solutions
 
Autopilot for Everyone Series Session 2: Elevate Your Automation Skills
Autopilot for Everyone Series Session 2: Elevate Your Automation SkillsAutopilot for Everyone Series Session 2: Elevate Your Automation Skills
Autopilot for Everyone Series Session 2: Elevate Your Automation Skills
UiPathCommunity
 
LVM Management & Disaster Recovery - RHCSA+.pdf
LVM Management & Disaster Recovery - RHCSA+.pdfLVM Management & Disaster Recovery - RHCSA+.pdf
LVM Management & Disaster Recovery - RHCSA+.pdf
RHCSA Guru
 
How PIM Improves Product Data Across All Sales Channels
How PIM Improves Product Data Across All Sales ChannelsHow PIM Improves Product Data Across All Sales Channels
How PIM Improves Product Data Across All Sales Channels
OEX Tech Solutions Pvt Ltd
 
Artificial Intelligence (AI) Security, Attack Vectors, Defense Techniques, Et...
Artificial Intelligence (AI) Security, Attack Vectors, Defense Techniques, Et...Artificial Intelligence (AI) Security, Attack Vectors, Defense Techniques, Et...
Artificial Intelligence (AI) Security, Attack Vectors, Defense Techniques, Et...
Salman Baset
 
The History of Artificial Intelligence: From Ancient Ideas to Modern Algorithms
The History of Artificial Intelligence: From Ancient Ideas to Modern AlgorithmsThe History of Artificial Intelligence: From Ancient Ideas to Modern Algorithms
The History of Artificial Intelligence: From Ancient Ideas to Modern Algorithms
isoftreview8
 
Jeremy Millul - A Junior Software Developer
Jeremy Millul - A Junior Software DeveloperJeremy Millul - A Junior Software Developer
Jeremy Millul - A Junior Software Developer
Jeremy Millul
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
BrainSell Technologies
 
Doctronic's 5M Seed Funding Pioneering AI-Powered Healthcare Solutions.pdf
Doctronic's 5M Seed Funding Pioneering AI-Powered Healthcare Solutions.pdfDoctronic's 5M Seed Funding Pioneering AI-Powered Healthcare Solutions.pdf
Doctronic's 5M Seed Funding Pioneering AI-Powered Healthcare Solutions.pdf
davidandersonofficia
 
Whitepaper-API-Design-Best-Practices. Prowess software services
Whitepaper-API-Design-Best-Practices. Prowess software servicesWhitepaper-API-Design-Best-Practices. Prowess software services
Whitepaper-API-Design-Best-Practices. Prowess software services
Prowess Software Services Inc
 
Next Generation of Developer by Ben Hicks
Next Generation of Developer by Ben HicksNext Generation of Developer by Ben Hicks
Next Generation of Developer by Ben Hicks
gdgcincy
 
oil seed milling- extraction and Refining
oil seed milling- extraction and Refiningoil seed milling- extraction and Refining
oil seed milling- extraction and Refining
MaheshKadam154653
 
Fault-tolerant, distrbuted AAA architecture supporting connectivity disruption
Fault-tolerant, distrbuted AAA architecture supporting connectivity disruptionFault-tolerant, distrbuted AAA architecture supporting connectivity disruption
Fault-tolerant, distrbuted AAA architecture supporting connectivity disruption
Karri Huhtanen
 
Teach the importance of logic (programming)in Computer Science and why it is ...
Teach the importance of logic (programming)in Computer Science and why it is ...Teach the importance of logic (programming)in Computer Science and why it is ...
Teach the importance of logic (programming)in Computer Science and why it is ...
Universidad Rey Juan Carlos
 
UiPath Community - AI Center and LLM.pdf
UiPath Community - AI Center and LLM.pdfUiPath Community - AI Center and LLM.pdf
UiPath Community - AI Center and LLM.pdf
UiPathCommunity
 
The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...
The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...
The Gold Jacket Journey - How I passed 12 AWS Certs without Burning Out (and ...
VictorSzoltysek
 
Autopilot for Everyone Series - Session 3: Exploring Real-World Use Cases
Autopilot for Everyone Series - Session 3: Exploring Real-World Use CasesAutopilot for Everyone Series - Session 3: Exploring Real-World Use Cases
Autopilot for Everyone Series - Session 3: Exploring Real-World Use Cases
UiPathCommunity
 
The Five Pillars of AI Readiness Webinar
The Five Pillars of AI Readiness WebinarThe Five Pillars of AI Readiness Webinar
The Five Pillars of AI Readiness Webinar
BrainSell Technologies
 
Introduction to LLM Post-Training - MIT 6.S191 2025
Introduction to LLM Post-Training - MIT 6.S191 2025Introduction to LLM Post-Training - MIT 6.S191 2025
Introduction to LLM Post-Training - MIT 6.S191 2025
Maxime Labonne
 
Managing Multiple Logical Volumes - RHCSA+.pdf
Managing Multiple Logical Volumes - RHCSA+.pdfManaging Multiple Logical Volumes - RHCSA+.pdf
Managing Multiple Logical Volumes - RHCSA+.pdf
RHCSA Guru
 
What is Agnetic AI : An Introduction to AI Agents
What is Agnetic AI : An Introduction to AI AgentsWhat is Agnetic AI : An Introduction to AI Agents
What is Agnetic AI : An Introduction to AI Agents
Techtic Solutions
 
Autopilot for Everyone Series Session 2: Elevate Your Automation Skills
Autopilot for Everyone Series Session 2: Elevate Your Automation SkillsAutopilot for Everyone Series Session 2: Elevate Your Automation Skills
Autopilot for Everyone Series Session 2: Elevate Your Automation Skills
UiPathCommunity
 

SSD Deployment Strategies for MySQL

  • 1. SSD Deployment Strategies for MySQL Yoshinori Matsunobu Lead of MySQL Professional Services APAC Sun Microsystems Yoshinori.Matsunobu@sun.com Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 1
  • 2. What do you need to consider? (H/W layer) • SSD or HDD? • Interface – SATA/SAS or PCI-Express? • RAID – H/W RAID, S/W RAID or JBOD? • Network – Is 1GbE enough? • Memory – Is 2GB RAM + PCI-E SSD faster than 64GB RAM + 8HDDs? • CPU – Nehalem or older Xeon? Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 2
  • 3. What do you need to consider? • Redundancy – RAID – DRBD (network mirroring) – Semi-Sync MySQL Replication – Async MySQL Replication • Filesystem – ext3, xfs, raw device ? • File location – Data file, Redo log file, etc • SSD specific issues – Write performance deterioration – Write endurance Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 3
  • 4. Why SSD? IOPS! • IOPS: Number of (random) disk i/o operations per second • Almost all database operations require random access – Selecting records by index scan – Updating records – Deleting records – Modifying indexes • Regular SAS HDD : 200 iops per drive (disk seek & rotation is slow) • SSD : 2,000+ (writes) / 5,000+ (reads) per drive – highly depending on SSDs and device drivers • Let’s start from basic benchmarks Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 4
  • 5. Tested HDD/SSD for this session • SSD – Intel X25-E (SATA, 30GB, SLC) – Fusion I/O (PCI-Express, 160GB, SLC) • HDD – Seagate 160GB SAS 15000RPM Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 5
  • 6. Table of contents • Basic Performance on SSD/HDD – Random Reads – Random Writes – Sequential Reads – Sequential Writes – fsync() speed – Filesystem difference – IOPS and I/O unit size • MySQL Deployments Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 6
  • 7. Random Read benchmark Direct Random Read IOPS (Single Drive, 16KB, xfs) 45000 40000 35000 30000 25000 HDD IOPS 20000 Intel SSD 15000 Fusion I/O 10000 5000 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 # of I/O threads • HDD: 196 reads/s at 1 i/o thread, 443 reads/s at 100 i/o threads • Intel : 3508 reads/s at 1 i/o thread, 14538 reads/s at 100 i/o threads • Fusion I/O : 10526 reads/s at 1 i/o thread, 41379 reads/s at 100 i/o threads • Single thread throughput on Intel is 16x better than on HDD, Fusion is 25x better • SSD’s concurrency (4x) is much better than HDD’s (2.2x) • Very strong reason to use SSD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 7
  • 8. High Concurrency • Single SSD drive has multiple NAND Flash Memory chips (i.e. 40 x 4GB Flash Memory = 160GB) • Highly depending on I/O controller and Applications – Single threaded application can not gain concurrency advantage Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 8
  • 9. PCI-Express SSD CPU North Bridge South Bridge PCI-Express Controller SAS/SATA Controller 2GB/s (PCI-Express x 8) 300MB/s SSD I/O Controller SSD I/O Controller Flash Flash • Advantage – PCI-Express is much faster interface than SAS/SATA • (current) Disadvantages – Most motherboards have limited # of PCI-E slots – No hot swap mechanism Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 9
  • 10. Write performance on SSD Random Write IOPS (16KB Blocks) 20000 18000 16000 14000 12000 1 i/o thread 10000 100 i/o threads 8000 6000 4000 2000 0 HDD(4 RAID10 xfs) Intel(xfs) Fusion (xfs) • Very strong reason to use SSD • But wait.. Can we get a high write throughput *anytime*? – Not always.. Let’s check how data is written to Flash Memory Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 10
  • 11. Understanding how data is written to SSD (1) Block (empty) Block (empty) Block (empty) Block Page Page …. Flash memory chips • Single SSD drive consists of many flash memory chips (i.e. 2GB) • A flash memory chip internally consists of many blocks (i.e. 512KB) • A block internally consists of many pages (i.e. 4KB) • It is *not* possible to overwrite to a non-empty block – Reading from pages is possible – Writing to pages in an empty block is possible – Appending is possible – Overwriting to pages in a non-empty block is *not* possible Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 11
  • 12. Understanding how data is written to SSD (2) Block (empty) Block (empty) New data Block (empty) Block × Page Page …. • Overwriting to a non-empty block is not possible • Writing new data to an empty block instead • Writing to a non-empty block is fast (-200 microseconds) • Even though applications write to same positions in same files (i.e. InnoDB Log File), written pages/blocks are distributed (Wear-Leveling) Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 12
  • 13. Understanding how data is written to SSD (3) Block P Block P Block P P P P 1. Reading all pages Block P Block P Block P P P New P 2. Erasing the block Block Block P Block P P P P 3. Writing all data P P • In the long run, almost all blocks will be fully used New P – i.e. Allocating 158GB files on 160GB SSD • New empty block must be allocated on writes • Basic steps to write new data: – 1. Reading all pages from a block – 2. ERASE the block – 3. Writing all data w/ new data into the block • ERASE is very expensive operation (takes a few milliseconds) • At this stage, write performance becomes very slow because of massive ERASE operations Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 13
  • 14. Data Space Reserved Space Reserved Space Block P Block P Block P Block (empty) P P P Block P Block P Block P Block (empty) P P P Block Block P Block P 2. Writing data P P P 1. Reading pages P New data Background jobs ERASE unused blocks P • To keep high enough write performance, SSDs have a feature of “reserved space” • Data size visible to applications is limited to the size of data space – i.e. 160GB SSD, 120GB data space, 40GB reserved space • Fusion I/O has a functionality to change reserved space size – # fio-format -s 96G /dev/fct0 Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 14
  • 15. Write performance deterioration Write IOPS deterioration (16KB random write) 30000 Continuous write-intensive workloads 25000 20000 IOPS Fastest 15000 Slowest 10000 5000 Stopping writing for a while 0 Intel Fusion(150G) Fusion(120G) Fusion(96G) Fusion(80G) • At the beginning, write IOPS was close to “Fastest” line • When massive writes happened, write IOPS gradually deteriorated toward “Slowest” line (because massive ERASE happened) • Increasing reserved space improves steady-state write throughput • Write IOPS recovered to “Fastest” when stopping writes for a long time (Many blocks were ERASEd by background job) • Highly depending on Flash memory and I/O controller (TRIM support, ERASE scheduling, etc) Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 15
  • 16. Sequential I/O Sequential Read/Write throughput (1MB consecutive reads/writes) 600 500 400 MB/s Seq read 300 Seq write 200 100 0 4 HDD(raid10, xfs) Intel(xfs) Fusion(xfs) • Typical scenario: Full table scan (read), logging/journaling (write) • SSD outperforms HDD for sequential reads, but less significant • HDD (4 RAID10) is fast enough for sequential i/o • Data transfer size by sequential writes tends to be huge, so you need to care about write deterioration on SSD • No strong reason to use SSD for sequential writes Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 16
  • 17. fsync() speed fsync speed 20000 18000 16000 14000 fsync/sec 12000 1KB 10000 8KB 8000 16KB 6000 4000 2000 0 HDD(xfs) Intel (xfs) Fusion I/O(xfs) • 10,000+ fsync/sec is fine in most cases • Fusion I/O was CPU bound (%system), not I/O bound (%iowait). Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 17
  • 18. HDD is fast for sequential writes / fsync • Best Practice: Writes can be boosted by using BBWC (Battery Backed up Write Cache), especially for REDO Logs (because it’s sequentially written) • No strong reason to use SSDs here seek & rotation time Write cache disk disk seek & rotation time Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 18
  • 19. Filesystem matters Random write iops (16KB Blocks) 20000 18000 16000 14000 12000 1 thread iops 10000 8000 16 thread 6000 4000 2000 0 Fusion(ext3) Fusion (xfs) Fusion (raw) Filesystem • On xfs, multiple threads can write to the same file if opened with O_DIRECT, but can not on ext* • Good concurrency on xfs, close to raw device • ext3 is less optimized for Fusion I/O Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 19
  • 20. Changing I/O unit size Read IOPS and I/O unit size (4 HDD RAID10) 2500 2000 1KB 1500 IOPS 4KB 1000 16KB 500 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 concurrency • On HDD, maximum 22% performance difference was found between 1KB and 16KB • No big difference when concurrency < 10 Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 20
  • 21. Changing I/O unit size on SSD Read IOPS and I/O unit size (Fusion I/O) 200000 150000 1KB IOPS 100000 4KB 16KB 50000 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 concurrency • Huge difference • On SSDs, not only IOPS, but also I/O transfer size matters • It’s worth considering that Storage Engines support “configurable block size” functionality Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 21
  • 22. Let’s start MySQL benchmarking • Base: Disk-bound application (DBT-2) running on: – Sun Fire X4270 – Nehalem 8 Core – 4 HDD – RAID1+0, Write Cache with Battery • What will happen if … – Replacing HDD with Intel SSD (SATA) – Replacing HDD with Fusion I/O (PCI-E) – Moving log files and ibdata to HDD – Not using Nehalem – Using two Fusion I/O drives with Software RAID1 – Deploying DRBD protocol B or C • Replacing 1GbE with 10GbE – Using MySQL 5.5.4 Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 22
  • 23. DBT-2 condition • SuSE Enterprise Linux 11, xfs • MySQL 5.5.2M2 (InnoDB Plugin 1.0.6) • 200 Warehouses (20GB – 25GB hot data) • Buffer pool size – 1GB – 2GB – 5GB – 30GB (large enough to cache all data) • 1000 seconds warm up time • Running 3600 seconds (1 hour) • Fusion I/O: 96GB data space, 64GB reserved space Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 23
  • 24. HDD vs Intel SSD HDD Intel Buffer pool 1G 1125.44 5709.06 (NOTPM: Transactions per minute) • Storing all data on HDD or Intel SSD • Massive disk i/o happens – Random reads for all accesses – Random writes for updating rows and indexes – Sequential writes for REDO log files, etc • SSD is very good at these kinds of workloads • 5.5 times performance improvement, without any application change! Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 24
  • 25. HDD vs Intel SSD vs Fusion I/O HDD Intel Fusion I/O Buffer pool 1G 1125.44 5709.06 15122.75 • Fusion I/O is a PCI-E based SSD • PCI-E is much faster than SAS/SATA • 14x improvement compared to 4HDDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 25
  • 26. Which should we spend money, RAM or SSD? HDD Intel Fusion I/O Buffer pool 1G 1125.44 5709.06 15122.75 Buffer pool 2G 1863.19 Buffer pool 5G 4385.18 Buffer pool 30G 36784.76 (Caching all hot data) • Increasing RAM (buffer pool size) reduces random disk reads – Because more data are cached in the buffer pool • If all data are cached, only disk writes (both random and sequential) happen • Disk writes happen asynchronously, so application queries can be much faster • Large enough RAM + HDD outperforms too small RAM + SSD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 26
  • 27. Which should we spend money, RAM or SSD? HDD Intel Fusion I/O Buffer pool 1G 1125.44 5709.06 15122.75 Buffer pool 2G 1863.19 7536.55 20096.33 Buffer pool 5G 4385.18 12892.56 30846.34 Buffer pool 30G 36784.76 - 57441.64 (Caching all hot data) • It is not always possible to cache all hot data • Fusion I/O + good amount of memory (5GB) was pretty good • Basic rule can be: – If you can cache all active data, large enough RAM + HDD – If you can’t, or if you need extremely high throughput, spending on both RAM and SSD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 27
  • 28. Let’s think about MySQL file location • SSD is extremely good at random reads • SSD is very good at random writes • HDD is good enough at sequential reads/writes • No strong reason to use SSD for sequential writes • Random I/O oriented: – Data Files (*.ibd) • Sequential reads if doing full table scan – Undo Log, Insert Buffer (ibdata) • UNDO tablespace (small in most cases, except for running long-running batch) • On-disk insert buffer space (small in most cases, except that InnoDB can not catch up with updating indexes) • Sequential Write oriented: – Doublewrite Buffer (ibdata) • Write volume is equal to *ibd files. Huge – Binary log (mysql-bin.XXXXXX) – Redo log (ib_logfile) – Backup files Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 28
  • 29. Moving sequentially written files into HDD Fusion I/O Fusion I/O + HDD Up Buffer pool 1G 15122.75 19295.94 +28% (us=25%, wa=15%) (us=32%, wa=10%) Buffer pool 2G 20096.33 25627.49 +28% (us=30%, wa=12.5%) (us=36%, wa=8%) Buffer pool 5G 30846.34 39435.25 +28% (us=39%, wa=10%) (us=49%, wa=6%) Buffer pool 30G 57441.64 66053.68 +15% (us=70%, wa=3.5%) (us=77%, wa=1%) • Moving ibdata, ib_logfile, (+binary logs) into HDD • High impact on performance – Write volume to SSD becomes half because doublewrite area is allocated in HDD – %iowait was significantly reduced – You can delay write performance deterioration Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 29
  • 30. Does CPU matter? Nehalem Older Xeon CPUs Memory CPUs QPI: 25.6GB/s FSB: 10.6GB/s North Bridge North Bridge (IOH) Memory (MCH) PCI-Express PCI-Express • Nehalem has two big advantages 1. Memory is directly attached to CPU : Faster for in-memory workloads 2. Interface speed between CPU and North Bridge is 2.5x higher, and interface traffics do not conflict with CPU<->Memory workloads : Faster for disk i/o workloads when using PCI-Express SSDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 30
  • 31. Harpertown X5470 (older Xeon) vs Nehalem X5570 (HDD) HDD Harpertown X5470, Nehalem(X5570, Up 3.33GHz 2.93GHz) Buffer pool 1G 1135.37 (us=1%) 1125.44 (us=1%) -1% Buffer pool 2G 1922.23 (us=2%) 1863.19 (us=2%) -3% Buffer pool 5G 4176.51 (us=7%) 4385.18(us=7%) +5% Buffer pool 30G 30903.4 (us=40%) 36784.76 (us=40%) +19% us: userland CPU utilization • CPU difference matters on CPU bound workloads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 31
  • 32. Harpertown X5470 vs Nehalem X5570 (Fusion) Fusion I/O+HDD Harportown X5470, Nehalem(X5570, Up 3.33GHz 2.93GHz) Buffer pool 1G 13534.06 (user=35%) 19295.94 (user=32%) +43% Buffer pool 2G 19026.64 (user=40%) 25627.49 (user=37%) +35% Buffer pool 5G 30058.48 (user=50%) 39435.25 (user=50%) +31% Buffer pool 30G 52582.71 (user=76%) 66053.68 (user=76%) +26% • TPM difference was much higher than HDD • For disk i/o bound workloads (buffer pool 1G/2G), CPU utilizations on Nehalem were smaller, but TPM were much higher – Verified that Nehalem is much more efficient for PCI-E workloads • Benefit from high interface speed between CPU and PCI-Express • Fusion I/O fits with Nehalem much better than with traditional CPUs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 32
  • 33. We need to think about redundancy overhead • Single server + No RAID is meaningless in the real database world • Redundancy – RAID 1 / 5 / 10 – Network mirroring (DRBD) – Replication (Sync / Async) • Relative overhead for redundancy will be (much) higher than on traditional HDD environment Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 33
  • 34. Fusion I/O + Software RAID1 • Fusion I/O itself has RAID5 feature – Writing parity bits into Flash Memory – Flash Chips are not Single Point of Failure – Controller / PCI-E Board is Single Point of Failure • Right now no H/W RAID controller is provided for PCI-E SSDs • Using Software RAID1 (or RAID10) – Two Fusion I/O drives in the same machine Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 34
  • 35. Understanding how software RAID1 works H/W RAID1 App/DB S/W RAID1 App/DB Writing to files Writing to files on /dev/sdX Response Response on /dev/md0 Write cache with battery Software RAID daemon RAID controller “md0_raid1” process Background writes Writing to disks (in parallel) (in parallel) Disk1 Disk2 Disk1 Disk2 • Response time on Software RAID1 is max(time-to-write-to-disk1, time-to-write-to-disk2) • If either of the two takes time for ERASE, response time will be longer • On faster storages / faster writes (i.e. sequential write + fsync), relative overheads of the software raid process are higher Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 35
  • 36. Random Write IOPS, S/W RAID1 vs No-RAID Random Write IOPS (Fusion I/O 160GB SLC, 16KB I/O unit, XFS) 50000 45000 40000 35000 No-RAID (120G) 30000 IOPS S/W RAID1 (120G) 25000 No-RAID (96G) 20000 15000 S/W RAID1 (96G) 10000 5000 0 1 61 121 181 241 301 361 421 481 Running time (minutes) • 120GB data space = 40GB additional reserved space • 96GB data space = 64GB additional reserved space • On S/W RAID1, IOPS deteriorated more quickly than on No-RAID • On S/W RAID1 with 96GB data space, the slowest line was smaller than No-RAID • 20-25% performance drop can be expected on disk write bound workloads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 36
  • 37. What about Reads? Read IOPS (16KB Blocks) 80000 70000 60000 50000 IOPS No-RAID 40000 S/W RAID1 30000 20000 10000 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 concurrency • Theoretically reads IOPS can be twice by RAID1 • Peak IOPS was 43636 on No-RAID, 75627 on RAID, 73% up • Good scalability Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 37
  • 38. DBT-2, No-RAID vs S/W RAID on Fusion I/O Fusion I/O+HDD RAID 1 Fusion %iowait Down I/O+HDD Buffer pool 1G 19295.94 15468.81 10% -19.8% Buffer pool 2G 25627.49 21405.23 8% -16.5% Buffer pool 5G 39435.25 35086.21 6-7% -11.0% Buffer pool 30G 66053.68 66426.52 0-1% +0.56% Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 38
  • 39. Intel SSDs with a traditional H/W raid controller Single raw Intel Four RAID5 Intel Down Buffer pool 1G 5709.06 2975.04 -48% Buffer pool 2G 7536.55 4763.60 -37% Buffer pool 5G 12892.56 11739.27 -9% • Raw SSD drives performed much better than using a traditional H/W raid controller – Even on RAID10 performance was worse than single raw drive – H/W Raid controller seemed serious bottleneck – Make sure SSD drives have write cache and capacitor itself (Intel X25- V/M/E doesn’t have capacitor) • Use JBOD + write cache + capacitor • Research appliances such as Schooner, Gear6, etc • Wait until H/W vendors release great H/R raid controllers that work well with SSDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 39
  • 40. What about DRBD? • Single server is not Highly Available – Mother Board/RAID Controller/etc are Single Point of Failure • Heartbeat + DRBD + MySQL is one of the most common HA (Active/Passive) solutions • Network might be a bottleneck – 1GbE -> 10GbE, InfiniBand, Dolphin Interconnect, etc • Replication level – Protocol A (async) – Protocol B (sync to remote drbd receiver process) – Protocol C (sync to remote disk) • Network channel is single threaded – Storing all data under /data (single DRBD partition) => single thread – Storing log/ibdata under /hdd, *ibd under /ssd => two threads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 40
  • 41. DRBD Overheads on HDD HDD No DRBD DRBD Protocol DRBD Protocol B, B, 1GbE 10GbE Buffer pool 1G 1125.44 1080.8 1101.63 Buffer pool 2G 1863.19 1824.75 1811.95 Buffer pool 5G 4385.18 4285.22 4326.22 Buffer pool 30G 36784.76 32862.81 35689.67 • DRBD 8.3.7 • DRBD overhead (protocol B) was not big on disk i/o bound workloads • Network bandwidth difference was not big on disk i/o bound workloads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 41
  • 42. DRBD Overheads on Fusion I/O Fusion I/O+HDD No DRBD DRBD Protocol Down DRBD Protocol Down B, 1GbE B, 10GbE Buffer pool 1G 19295.94 5976.18 -69.0% 12107.88 -37.3% Buffer pool 2G 25627.49 8100.5 -68.4% 16776.19 -34.5% Buffer pool 5G 39435.25 16073.9 -59.2% 30288.63 -23.2% Buffer pool 30G 66053.68 37974 -42.5% 62024.68 -6.1% • DRBD overhead was not negligible • 10GbE performed much better than 1GbE • Still 6-10 times faster than HDD • Note: DRBD supports faster interface such as InfiniBand SDP and Dolphin Interconnect Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 42
  • 43. Misc topic: Insert performance on InnoDB vs MyISAM (HDD) Time to insert 1 million records (HDD) 5000 4000 250 rows/s Seconds 3000 innodb 2000 myisam 1000 0 1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 Existing records (millions) • MyISAM doesn’t do any special i/o optimization like “Insert Buffering” so a lot of random reads/writes happen, and highly depending on OS • Disk seek & rotation overhead is really serious on HDD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 43
  • 44. Note: Insert Buffering (InnoDB feature) • If non-unique, secondary index blocks are not in memory, InnoDB inserts entries to a special buffer(“insert buffer”) to avoid random disk i/o operations – Insert buffer is allocated on both memory and innodb SYSTEM tablespace • Periodically, the insert buffer is merged into the secondary index trees in the database (“merge”) Insert buffer • Pros: Reducing I/O overhead – Reducing the number of disk i/o operations by merging i/o requests to the same block Optimized i/o – Some random i/o operations can be sequential • Cons: Additional operations are added Merging might take a very long time – when many secondary indexes must be updated and many rows have been inserted. – it may continue to happen after a server shutdown and restart Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 44
  • 45. Insert performance: InnoDB vs MyISAM (SSD) Time to insert 1million records (SSD) 600 500 2,000 rows/s 400 Seconds InnoDB 300 MyISAM 200 5,000 rows/s 100 0 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 Existing records (millions) Index size exceeded buffer pool size Filesystem cache was fully used, disk reads began • MyISAM got much faster by just replacing HDD with SSD ! Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 45
  • 46. Try MySQL 5.5.4 ! Fusion I/O + HDD MySQL5.5.2 MySQL5.5.4 Up Buffer pool 1G 19295.94 24019.32 +24% Buffer pool 2G 25627.49 32325.76 +26% Buffer pool 5G 39435.25 47296.12 +20 Buffer pool 30G 66053.68 67253.45 +1.8% • Got 20-26% improvements for disk i/o bound workloads on Fusion I/O – Both CPU %user and %iowait were improved • %user: 36% (5.5.2) to 44% (5.5.4) when buf pool = 2g • %iowait: 8% (5.5.2) to 5.5% (5.5.4) when buf pool = 2g, but iops was 20% higher – Could handle a lot more concurrent i/o requests in 5.5.4 ! – No big difference was found on 4 HDDs • Works very well on faster storages such as Fusion I/O, lots of disks Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 46
  • 47. Conclusion for choosing H/W • Disks – PCI-E SSDs (i.e. Fusion I/O) perform very well – SAS/SATA SSDs (i.e. Intel X25) – Carefully research RAID controller. Many controllers do not scale with SSD drives – Keep enough reserved space if you need to handle massive write traffics – HDD is good at sequential writes • Use fast network adapter – 1GbE will be saturated on DRBD – 10GbE or Infiniband • Use Nahalem CPU – Especially when using PCI-Express SSDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 47
  • 48. Conclusion for database deployments • Put sequentially written files on HDD – ibdata, ib_logfile, binary log files – HDD is fast enough for sequential writes – Write performance deterioration can be mitigated – Life expectancy of SSD will be longer • Put randomly accessed files on SSD – *ibd files, index files(MYI), data files(MYD) – SSD is 10x -100x faster for random reads than HDD • Archive less active tables/records to HDD – SSD is still much expensive than HDD • Use InnoDB Plugin – Higher scalability & concurrency matters on faster storage Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 48
  • 49. What will happen in the real database world? • These are just my thoughts.. • Less demand for NoSQL – Isn’t it enough for many applications just to replace HDD with Fusion I/O? – Importance on functionality will be relatively stronger • Stronger demand for Virtualization – Single server will have enough capacity to run two or more mysqld instances • I/O volume matters – Not just IOPS – Block size, disabling doublewrite, etc • Concurrency matters – Single SSD scales as well as 8-16 HDDs – Concurrent ALTER TABLE, parallel query Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 49
  • 50. Special Thanks To • Koji Watanabe – Fusion I/O Japan • Hideki Endo – Sumisho Computer Systems, Japan – Rent me two Fusion I/O 160GB SLC drives • Daisuke Homma, Masashi Hasegawa - Sun Japan – Did benchmarks together Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 50
  • 51. Thanks for attending! • Contact: – E-mail: Yoshinori.Matsunobu@sun.com – Blog http://yoshinorimatsunobu.blogspot.com – @matsunobu on Twitter Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 51
  • 52. Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 52
close