Bad flash storage: the 5 most common problems

0

What you will learn:

  • The two main factors of SSD failure.
  • What are the top five vendors of NAND flash storage problems?

Over the past decade, NAND flash storage has emerged as the preferred device for storing and accessing all kinds of data, from video recordings and streaming, personal storage, and operating system provision to recording. data, application acceleration and more. The rate of innovation has increased both speed and storage capacity by multiple factors.

The only aspect that has declined, at least generally speaking, is reliability. With very short new product introduction cycles of just a few months, the time to fully test and verify complex functionality is no longer spent. As a result, immature products are entering the market, which then depend on multiple firmware updates in the field to eliminate issues identified by customer testing.

In most cases, this is not published, and NAND storage issues are not shared outside of the affected company, unless the damage affects the general public. Tesla, for example, recently had to recall 134,000 cars due to the premature failure of an undersized integrated multimedia card (eMMC).

When it comes to solid-state drive (SSD) failure, there are two main aspects to consider: hardware and firmware.

Hardware defines raw bit error rate (percentage of block reads with bit errors before they pass the error correction unit), cell data retention, and temperature range supported . The firmware should handle even flash wear, perform bit error correction, and mitigate the effects of temperature data and power loss issues.

The following are the top five instigators when it comes to NAND flash storage issues.

1. Poor NAND quality.

NAND flash memory is a commodity and should keep the cost per gigabyte low. Many developments (3D NAND, QLC) are mainly motivated by this objective. For use in cell phones and personal PCs / laptops, consumer grade NAND is sufficient. This is not true for more demanding applications such as enterprise storage or industrial / networking and communications applications.

The JEDEC standardization consortium has defined two main use cases and their respective quality requirements:

  • Customer use case: PC user type workload, 8 hours / day, 40 ° C, uncorrectable error rate (UBER) -15
  • Business use cases: Database type workload, 24 hours / day, 55 ° C, uncorrectable error rate (UBER) -16

Both 10-15 and 10-16 appear to be extremely low numbers, but the difference means that a client drive will fail 10 times more often than an enterprise drive. With the high throughput of modern SSDs, the likelihood of SSD failure is no longer negligible.

Today’s NAND flash raw bit error rate is in the range of 10-2 for lower classes and 10-3 for cutting edge technology. Different levels of error correction reduce the UBER rate to the requested UBER levels. The level of flash quality and the level of error handling have a direct impact on the sale price. As a rule of thumb: don’t put a cheap, commercial-grade SSD in an application that requires a low error rate.

2. Bad NAND design.

3D NAND cells are a very complex stack of several layers. Currently, some devices have more than 140 diapers. Manufacturing requires the etching of very thin, but very deep, holes in a sandwich layer of hundreds of polysilicon and silicon oxide deposits. Due to the nature of the etching, the lower part of the hole is much narrower than the upper part resulting in different electrical properties of the transistors. This makes it very difficult to read different cells reliably. Adding temperature changes between reading and writing adds a dimension of variances.

Not all NAND designs are designed to provide good enough data when the temperature changes between write and read. As long as the SSD product resides in a thermally well-controlled system, for example in personal PCs, laptops, servers or handhelds, the temperature variation is too small to cause problems.

For industrial or NetCom applications, the requirements of NAND increase dramatically and the NAND design and supporting firmware must accommodate large temperature fluctuations. Wrong choice of flash product can cause multiple problems once the system has to operate under fluctuating temperature conditions.

3. Poor mechanical stability.

Have you ever heard of thermomechanical stresses? It happens when temperature fluctuations affect structures that combine elements with different thermal expansion factors, that is, some parts expand more at the same temperature change than others.

An SSD consists of a PCB with soldered flash packages, a controller, a connector, and small passives. All of them behave differently with changes in temperature. Since the housings are soldered to the PCB, the different expansion causes mechanical stress, which ultimately leads to broken interconnects. (Fig. 1 and 2).

This damage occurs after hundreds to thousands of temperature cycles and can even take years. But it matters a lot when it comes to industrial systems that have been in the field for a long time.

4. Robustness in the event of a power failure.

For a laptop that always shuts down smoothly, robustness to power outages isn’t an issue. For a medical device that is simply unplugged, or a NetCom router in an environment with an unstable power supply, a sudden power failure should not lead to a system failure.

Sudden power loss can occur at any time, during an external write to the SSD, during internal garbage collection, during firmware updates, even during recovery from a previous loss of data. ‘food. If the firmware does not properly handle the power loss, it will impact the severity of the data loss. In the best case, this is only the last data written (data on the fly); in the worst case, the firmware is corrupted and the SSD no longer works. In many critical applications, losing even a few bits of data is simply not acceptable.

Swissbit has tested the SSDs commonly available in the market and has seen all types of failures occur during power off testing.

5. Incorrect firmware architecture.

Speed ​​matters, at least for mainstream readers. Additionally, speed tests are usually performed when the drives are new, empty, and freshly formatted. What is often not taken into account is how much performance is left when the drive is 100% full, crashed multiple times, or maybe running at high temperatures. Many existing firmware architectures focus on performance specifications, but not on highest endurance or retention or sustained performance over the entire operating range.

Choosing an SSD that is not optimized for long-term use can lead to unpleasant surprises after the first drive life has passed. (Fig. 3).

Conclusion

Selecting the right SSD or NAND flash product depends on many criteria. Particularly when dealing with industrial use or demanding applications, the following aspects should be included in the decision making process: selection of the right components, mechanical construction, firmware architecture and robustness to power failures . It is the best way to find reliable data storage device to store and retrieve data for long life.

Share.

About Author

Comments are closed.