SLC versus MLC in Enterprise SSD arrays | Editor:- June 24, 2008 - The original purpose of my SSD Myths article was to show that you needn't worry about wear-out if you use "best of breed" flash SSDs with write-endurance on the order of 1 million cycles and above.
When it was first published (in March 2007) all flash SSDs in traditional hard disk form factors used SLC.
But in the year following publication many leading SSD oems (including Samsung, Mtron and STEC ) have also introduced MLC products too.
To confuse things even more - in June 2008 - Silicon Motion announced a new family of flash SSD controllers which enable oems to mix and match MLC and SLC chips in the same drive - creating in effect SLC-MLC hybrid SSDs.
MLC doubles the capacity of flash memory by interpreting 4 digital states in the signal stored in a single cell - instead of the traditional (binary) 2 digital states.
This technique has been commercialized and proven over many years in hundreds of millions of cell phones and MP3 / iPod music players - where the theoretical consequence of data corruption (if anything went wrong with this risky "new" storage technology) was no more serious than an inaudible sub millisecond sound blip or invisible pixel splat.
In the SSD market MLC yields much lower cost storage than SLC with read / write speeds which are nearly as fast as the best SLC devices.
The manufacturers of first generation "hard disk replacement" MLC flash SSDs have responsibly classified them as aimed at the "notebook market" and by subtle wording differentiated them from their more pricey "enterprise" products. In the low duty cycle world of a notebook these MLC SSDs should give a good operating life - typically similar to the hard disks they replace. (Most SSD marketers would claim their MTBFs are even better than HDDs).
But there's no way to tell the difference between SLC and MLC SSDs externally (apart from the model numbers). Put them in a rackmount system in a datacenter with fast processors which can pump them continuously close to the maximum speed and what happens? |  | It's a simple matter to plug new data for MLCs into the calculation I did for the worst case wear-out process for flash SSDs - which I called the Rogue Data Recorder.
Instead of the 64GB example I used then, I'll assume the MLC SSD has 128GB capacity. MLC SSDs have more capacity than SLC. And more capacity means longer operating life - before cells wear out.
I'll still use the 80M bytes / sec sustained write speed - because the fastest MLC products (in Feb 2008) can already do that. (Meanwhile the fastest SLC products have moved up in the world and are about 50% faster.)
The next factor is where we hit the big problem... Instead of a write endurance rating of 2 million cycles (for the best SLC) - I can only use a figure of 10,000 for MLC. MLC has a much lower rating due to the complex interaction of discriminating multiple logic levels reliably coupled with the intrinsic failure mechanism of wear-out.
Plugging these numbers in the same calculation gives an estimated MLC flash SSD operating life (at max write throughput) which is 6 months! (instead of 51 years for a 64GB SLC SSD).
That's not good enough for a data driven enterprise. There isn't a wide enough safety margin.
Proponents of MLC might say - can't you batch select MLC chips for better write endurance in the same way that some oems do for SLC wear out? - Couldn't that give a figure that is 10x better?
There's not enough data to give a definitive answer - but I suspect the answer would be no!
The reason is that you would be selecting for the mutual inclusion of a single chip being inside 2 different probability curves for what are already secondary characteristics. (Like looking for the ideal man in Sex and the City.) Even in the unlikely event that you could find some devices with the magic properties to do this - the yield would be small - pushing the cost up and eliminating the main reason for using MLC.
That's where I thought this "SLC versus MLC in enterprise SSDs" discussion would end. But then another factor appeared out of the blue.
Sam Anderson at EasyCo pointed out to me that one side effect of their patent pending Managed Flash Technology is that their software "effectively erases erase blocks 10 to 100 times less frequently than drives doing traditional random writes" because it writes address blocks monotonically.
EasyCo's MFT was originally designed to give much faster system IOPS in flash SSD arrays by using patent pending write algorithms which manage arrays of standard SSDs in a way which reduces the probability of successive writes to an address block which is already busy in a time consuming erase/write cycle.
This new (to me) attribute of MFT opens up the possibility of yet another generation of high speed rackmount SSDs with new price points which could be 50x lower than RAM SSDs while being only 3x slower overall in typical applications.
I'll return to that subject soon in a new article called - Demystifying SSD IOPS. |
| Conclusion?
I can't give a definitive answer to the question - Are MLC SSDs Ever Safe in Enterprise Apps?
With the current state of technology in 2008 - it depends on the application and the consequences of data corruption.
I wouldn't risk it if I were a bank - but I might not mind if my own bank risked it and changed some pluses to minuses...
Seriously though I hope this article has shown that there are serious risks inherent in using MLC flash SSDs if they are not applied correctly.
Some of these risks can be managed by choosing an SSD array supplier who has qualified and tested their racks with products from a single known source (because every make of MLC flash SSD has its own unique failure profile).
I know that despite my warnings - MLC flash SSDs will get used in some enterprise apps - because the cost difference (compared to other options) is very attractive.
In my view using an MLC flash SSD array for an enterprise application without at least using the (claimed) wear-out mitigating effects of a technology like Easyco's MFT is like jumping out of a plane without a parachute.
And even with a parachute - strange things may still happen to wannabe MLC SSD enterprise pioneers on the way down.
More Articles About Flash SSD Endurance
Can you trust your flash SSD specs? Increasing Flash Solid State Disk Reliability SSD Myths and Legends - "write endurance" Is All CompactFlash Really Created Equal? (pdf) Flash Disk Reliability Begins at the IC Level (pdf) Flash Solid State Disk Write Endurance in Database Environments | . | |
| | | . | | . | Are MLC SSDs More Susceptible to Power Rail Disturbance? | As someone who in a past career designed analog data acquisition products and systems which got right down below the thermal noise and who cared about the shape and material of PCB tracks I want to air another concern about the (in)/advisability of using MLC Nand flash in datacenter applications where there's a lot of power rail disturbance.
Although MLC devices have been used in commercial products since 2003 - the products they have been in (phones and portable music players) have been battery operated environments where (inside the casing) the environment's overall power rail and electromagnetic compatibility has been controlled and managed by the system designers who know enough about these things. And as I say elsewhere in this article - the consequences of misread data in these applications are trivial.
You could say almost the same about the environment for a MLC flash SSD inside a notebook PC. It's a known, testable environment. Although the user can plug modules in - they're rarely a high energy disturbance product. The designers would have tested it with a range of plug-ins, and they've sold millions of similar notebooks before. There will be few surprises.
An array of SSDs in a datacenter cabinet is not such a quiet place.
There are plenty of fast processors all around. Above you - below you. The SSD designer does not control that space. Every installation is unique.
Something which you may not be aware of - is that inside an MLC flash chip are effectively:- a 2 bit anlog to digital converter (ADC) and a 2 bit DAC. Between each of the 4 logic levels there is also an indeterminate band where the signal should never be. Power line disturbances are 3x more likely to result in a false read for MLC than SLC, but the overall error comparison gets worse. There's also a bigger intrinsic risk (for MLC than in SLC) of an error creeping in with the initial write charge. SSD designers deal with this by surrounding blocks of MLC flash data with heavier error detection and correction codes than they would normally use for SLC.
I found a good detailed discussion of ECC potential problems in this Denali article:- Memory ECC: A curiosity for decades, now essential for MLC NAND flash from which the quote below comes.
"With the voltage levels closer together for MLC flash the devices are again more susceptible to disturbs and transient occurrences, causing the generation of errors which then have to be detected and corrected. If that is not enough for the chip maker, it poses an even larger problem for the system designer, in that there is more of a variety of technologies employed among competing flash chip designs than DRAM makers, for example, would ever dream of."
For a related discussion about what EMC (not the storage company) can mean for signal integrity going into a flash SSD see the white paper - Noise Damping Techniques for PATA SSDs in Military-Embedded Systems (pdf) by SiliconSystems. |
| . | SSD Myths and Legends - "write endurance" | Does the fatal gene of "write endurance" built into flashsolid state disks prevent their deployment in intensive server acceleration applications - such as RAID systems? | It was certainly true as little as a few years ago.
What's the risk with today's devices?
This article looks at the current generation of products and calculates how much (or how little) you should be worried. |  |
| RAM based SSDs have been used alongside RAID for years - but flash SSDs are physically smaller and have bigger capacity (upto 412G in 2.5", 832G in 3.5") and are lower cost than RAM-SSDs and could actually be configured in standard RAID boxes.
F-SSDs aren't as fast as RAM based products but a single flash SSD can deliver 20,000 IOPs - which when scaled up in an array - starts to look interesting. ...read the article, storage reliabilitysolid state disks |
| . | More Conclusions
Flash SSDs are complex systems with a lot of stuff going on inside.
Like cars (which use the internal combustion engine) all flash SSDs from all manufacturers are not the same.
Even if they have the same capacity and interfaces.
There are many different process and media management technologies inside a a flash SSD which oems deal with (or not) in their own proprietary ways. These are just some of the consequences:- - best to worst wear leveling algorithms can vary product life by a factor of 3 to 1. (That's not too bad. Some so called "SSDs" - which are actually dumb flash storage bolted to a disk interface - don't have wear leveling and should not be used in servers at all.)
- best to worst SLC endurance can vary by 30 to 1.
- SLC to MLC endurance can vary from 10 to 1, upto 300 to 1
- intrinsic electrical noise susceptibility between SLC and MLC is hard to quantify - but probably on the order of 10 to 1. Although hidden by wrap around redundancy and error detection and correction - the possibility of uncorrectable errors is still greater in MLC - which is unproven in enterprise environments.
Buying flash SSDs for enterprise applications should be regarded as an important qualifying process. Just as you wouldn't buy a traditional RAID system without knowing what type of hard disks were inside it, or without knowing something about the experience of the vendor in enterprise apps - so too you shouldn't buy flash SSDs without asking about the factors discussed in this article.
The risk for users is that many oems who designed SSD architectures for the notebook market - will try to capture business in the enterprise market - with the same (or similar) products without dealing with the datacenter's need for better resilience and data reliability.
And, sadly, I know from my own inbox that some SSD marketers don't know how much they don't know about their own market and how much more advanced their competitors are in the field of reliability. |
| . | SLC MLC Hybrid SSDs
In June 2008 - Silicon Motion announced a new family of flash SSD controllers which enable oems to mix and match MLC and SLC chips in the same drive.
The controller can analyze the incoming files from the host and intelligently move frequently accessed data to SLC NAND and non-frequently accessed data to MLC NAND. With this innovative hybrid architecture, the SSD system cost is significantly reduced to a level comparable to a pure MLC-based SSD, while endurance is significantly enhanced and comparable to a pure SLC-based SSD.
However, the intrinsically higher susceptibility of MLC flash to electrical disturbances remains a risk factor in such hybrid devices. |
| . | Looking at MLC Writes in More Detail
The clearest description I've seen explaining the mechanics of MLC flash writes and the problems presented for an SSD controller are in a white paper by Hyperstone called Flash Controller & Firmware (pdf). From which this quote is taken.
"...charge levels might change due to external conditions such as extreme heat or magnetism. While the cell itself is not damaged permanently, the bit value might have changed and a read error might occur. Some more recent flashes have the capability of recognizing the systematic change in behavior or in the voltage level so that not the difference to a starting reference voltage, but the inability to differentiate the relative difference to other voltage levels produces the read errors."
Editor again... Hyperstone's description of Incremental Step Pulse Programming reminded me of the Intel 2816 (the first commercially available flash chip).
I had an early sample in about 1981. Programming was done by shaped pulses. The Intel suggested circuit for doing this didn't work, but it was easy to modify. Locations were programmed interactively using bursts until you read back the value you'd written - and then wrote some more bursts for safety. If you wrote too many pulses that could zap the device. Or it might mean the location was unusable. Later generations of flash memory hid these details from view. But once you've seen what happens - past the cloaking effect of a flash memory controller - you appreciate the delicate balances involved in making a working flash storage drive. |
|
|