23

I've had HDDs years ago that used to fail and Windows would warn me that HDD has serious problem and gave me time to do something about it, because otherwise after I reboot, it wasn't guaranteed for HDD to work again. That's like 10+ years ago.

I've had an SSD for the past 6+ years and I've been using it nonstop. It's a 256 GB SSD and I've written over 170 terabytes on it so far. In Windows disk and drives settings, I see it has still 54% of its lifetime remaining, which is amazing.

I want to know how reliable is this life time number exactly? I know that Windows setting uses S.M.A.R.T data to estimate the remaining life time, but are SSDs like HDDs and do they fail all of sudden just because of a bad sector or something like that? Or do they degrade over time gradually? I check that remaining life time every few months and it does go down 1% some times.

What to do about a critical warning for a storage device

Enter image description here

Enter image description here

More details for my SSD:

Enter image description here

TBW for my SSD is 160 TB, but I've already written 170 TB and SMART shows 54% life time remaining. it's been almost always running at ~50.C temp.

XPG SX8000 PCIe Gen3x4 M.2 2280 Solid State Drive

Giacomo1968
  • 53,069
  • 19
  • 162
  • 212
  • Does this answer your question? [What happens when an SSD wears out?](https://superuser.com/questions/345997/what-happens-when-an-ssd-wears-out) – Giacomo1968 Jan 25 '23 at 03:15

6 Answers6

31

You can never know when one specific drive is going to fail, or whether it will fail slowly enough to rescue data from, or fail suddenly and catastrophically.

SMART is a set of 'guesswork' algorithms, in effect. It can be a reliable predictor of slow decline, but it can never predict sudden total failure.

You always need a backup in place, and you need to periodically test it works. Waiting for the warning is just not reliable. This becomes even more important if a drive is encrypted, as any fail may take down the encryption keys, meaning the data is lost immediately & totally.

My oldest SSD is now about 10 years old. It still shows '100% health' when I look at the figures. I have two independent apps which background check the SMART data every few hours.
So far, so good.
My in-house backup runs every hour, my off-site overnight each night. I also make periodic direct clones.
One day the drive will fail. At that point I will order a new one & be back up & running, with no more than an hour's work lost, within half an hour of the new drive arriving.

I once had, due to total coincidence, two boot drives on two machines fail inside a few months. Both drives were relatively new, both from reliable manufacturers.
Nothing was lost in either case.

Tetsujin
  • 47,296
  • 8
  • 108
  • 135
  • 7
    It's never 100% predictable. The two drives I lost were both only about a year old. Both were reputable makes. Both died suddenly & totally. MTTF [mean time to failure] is also just an average. One will die the first day, another will run a decade or more, but the majority will be around the predicted time, in a bell-curve. So yes, it's prefectly possible that a 54% drive can fail tomorrow. – Tetsujin Jan 22 '23 at 12:19
  • 2
    @Tetsujin It's not a bell curve, it's a constantly decreasing curve. That's why the MTTF has been abandonned and replaced by the AFR (annual failure rate) by most of the vendors, because MTTF is not really pertinent for the actual failure curves. – PierU Jan 22 '23 at 13:43
  • You also seemed to completely miss my quote - https://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statistics – Tetsujin Jan 22 '23 at 18:08
  • 1
    OK, let me do it with numbers, with AFR=1%, out of 1,000,000 drives. Year 1: 10,000 drives die, 990,000 left; Year 2: 1% of the left ones, that is 9,900, die, 980100 left; Year 3: 1% of the left ones, that is 9,801, die, 970,299 left, etc... More drive died in year 1 compared to year 2, more drive in year 2 compared to year 3, etc... Still, in 3 years only 29,701 drives died out of 1,000,000, which is not a bad business. – PierU Jan 22 '23 at 19:08
  • 1
    @PierU - Hahahaha… I *love* statistics. Have you noticed that if you do it that way, the last drive lives forever? You have just tried to explain [Zeno's paradox](https://en.wikipedia.org/wiki/Zeno%27s_paradoxes) – Tetsujin Jan 22 '23 at 19:12
  • 1
    While the intro to this answer is correct, based on personal experience the particular SMART attribute in question tends to be pretty accurate on most good brands of SSD provided you’re not physically abusing the SSD or dealing with power issues. This is because the threshold set by most manufacturers is pretty conservative, the current value is trivial to compute (it’s literally just a persistent counter in firmware being compared against a fixed value), and most of the time most SSDs simply don’t fail catastrophically (they can, but it’s not hugely common). – Austin Hemmelgarn Jan 22 '23 at 20:03
  • 3
    @Tetsujin *" if you do it that way, the last drive lives forever"* --> nope. The last drive has still 1% chance of failure every year. The resulting life expectancy is finite, not infinite. That said, the constant AFR model is in practice valid only during a few years, and after that it tends to increase. – PierU Jan 22 '23 at 20:42
  • 1
    I was under the impression that it was most common to model hardware failures as a "bathtub curve", although I don't know if there's been research validating that this model works well for hard drives and SSDs. – James_pic Jan 23 '23 at 10:09
  • 2
    @James_pic You're correct, at least for HDDs, and this is confirmed by extensive studies made on thousands of drives by companies like Backblaze (who operate data centers for online baclup/storage). But for some reason drives vendors rate their drives in terms of constant AFR. – PierU Jan 23 '23 at 19:20
  • @Tetsujin I had the same experience as you. Usually, from my observations, an SSD just suddenly dies in 2 ways: either stops being detected entirely (like how my Kingston V200 died in less than 2 years) or it is detected but has a weird name/shows the controller name (like how a Goodram CL100 died after 3 hours of use). But, if the wear-leveling is running out of cells, or a cell is damaged, it should go into read-only mode. Those are the only 2 SSDs that died on me, and I've worked with about 40-50 SSDs (replacing on PCs and reinstalling windows and that fun stuff). – Ismael Miguel Jan 23 '23 at 22:28
  • 2
    @Tetsujin I think there's some misunderstanding on both sides here. PierU's saying that, if you take AFR at face value, you get exponential decay. Even though a model based purely on AFR would predict a decade-old drive to have the same odds of failing this year as a brand new one, since some have already failed over that decade, a plurality of drives fail during the first year (despite this being very rare). I think PierU is saying that this exponential decay model is standard among manufacturers, (1/2) – Radvylf Programs Jan 24 '23 at 04:32
  • 1
    (2/2) but you seem to be saying that a normal distribution (bell curve) is used instead, which could very well be true. You could still get an average AFR from a normal distribution, it just wouldn't be very meaningful, since as PierU's pointing out, that would imply exponential decay. I don't know who's right here, and it's too late at night for me to go digging into papers on what the best model for drive failure rates is, but PierU's definitely not trying to say that the majority of drives die early on, nor that in a finite sample of drives one will always last forever. – Radvylf Programs Jan 24 '23 at 04:47
  • 1
    A visualization of what an exponential decay model based on AFR would look like: https://www.desmos.com/calculator/zkbvi9ncvg (as you can see, only 10% of drives fail in the first year with an AFR of 10%, and only 10% of surviving drives fail in year 40, but there's still more failures in year 1 than any other simply because more drives are surviving to fail in the first place) – Radvylf Programs Jan 24 '23 at 04:49
22

SSD wear-out is mainly due to the cumulative amount of data written on it. So the vendors use accelerated tests and statistical models to quantify how much written data a particular model can stand, and they rate this model in terms of TBW (TeraBytes Written). The SMART "remaining lifetime" is based on it: if you have written 170TB and have 54% remaining, your drive is probably given around 370TBW.

What happens when the drive reaches the given TBW and the remaining lifetime is 0%? Nothing... The TBW is just a statistical value, say "After TBW, 99% of the drives are still operating correctly" (I don't know if it's 99%, 90%, 99.9%, but this is idea, with a given threshold): so it's perfectly possible that your particular lasts twice the given TBW (and it's also perfectly possible that it fails after half the TBW).

There are other SMART attributes that can better help predicting a failure such as the read error rate, the pending sectors count, the reallocated sectors count... When one of them starts increasing, you should worry about the drive. Note that one bad sector in itself, or even a few bad sectors, is not enough to say that the drive will fail soon.

And still, a SSD can also fail at any moment, without any warning, with all the SMART attributes that were OK. But it's not different from any electronic or mechanical product.

PierU
  • 1,539
  • 5
  • 20
  • 3
    TBW is about a specific failure mode: individual memory cells no longer reliably storing data. When the lifetime remaining hits 0, that means the manufacturer is no longer guaranteeing that written data will be readable. Other failure modes are independent of TBW. – Mark Jan 24 '23 at 01:02
9

but are SSDs like HDDs and do they fail all of sudden just because of a bad sector or something like that? or do they degrade over time gradually?

They most certainly degrade over time which has to do with a finite amount of program / erase cycles, which is basically what the remaining life time related attributes are trying to measure. The controller will try to make this wear happen evenly over the NAND.

It is also known that for example the ability of cells to retain 'data' decreases as amount of p/e cycles for those cell increases. IOW, the SSD closer to it's predicted EOL is not the same SSD it was as when you purchased it. So while these cells can still be programmed, they're in a worse shape than they once were.

As a result the SSD needs to do more maintenance, itself contributing to wear: This decreased data retention ability is countered by for example periodically refreshing data by the SSD (patrolling), which involves reading the data and writing it to a different location, so this process itself is also contributing to an increase in p/e cycles.

But SSD's can and will certainly also fail all of a sudden due to for example firmware bugs, firmware corruption, cosmic rays, sudden loss of power, physical trauma, wear of SMD components and whatnot.

The recovery rate for SSD by data recovery labs is considerably lower than that of conventional HDD's, so keeping backups is perhaps even more important (it's important anyway, but you get the point).

With regards to this particular case, the health score as displayed by the SMART tool is based on the 05 attribute:

enter image description here

54% is based on a single RAW value, attribute 05 'percentage used', 0x2E (46 decimal) - this value increases as situation deteriorates). Reserved spare capacity is still 100% available (0x64) - This value drops as situation deteriorates.

TBW for my SSD is 160 TB, but I've already written 170 TB and SMART shows 54% life time remaining.

It's not uncommon for SSD manufacturers to change specs and switch components.

Joep van Steen
  • 4,730
  • 1
  • 17
  • 34
  • _This decreased ability is countered by for example periodically refreshing data, which involves reading the data and writing it to a different location_ – This is actually done automatically by the flash controller. It's called static wear leveling. – forest Jan 22 '23 at 22:21
  • No wear leveling is not what I'm referring to. I am referring toe decreased data retention capability and measures to counter this. – Joep van Steen Jan 22 '23 at 22:27
  • You're thinking of dynamic wear leveling. Static wear leveling is a little different and is the process where data that has been "resting" for a long time gets read and written to a new area. It increases data retention capacity (as well as reducing wear). – forest Jan 22 '23 at 22:28
  • Ehm, no I think I am not. Wear leveling is about what goes where, I am referring to refreshing data that is at risk due to read/write disturb and charge leakage. – Joep van Steen Jan 22 '23 at 22:53
  • But you're describing a process where the flash controller periodically moves data that has been sitting a long time, right? – forest Jan 22 '23 at 22:55
  • Controller may decide the data for different reasons, *one* potential reason being wear leveling. So static data may be moved for reasons of wear leveling but it's not what I am referring to. – Joep van Steen Jan 22 '23 at 23:05
  • Oh, OK. I assumed that the static wear leveling process was used for both wear leveling _and_ to refresh data to counter leakage. – forest Jan 22 '23 at 23:05
4

are SSDs like HDDs and do they fail all of sudden just because of a bad sector or something like that? or do they degrade over time gradually?

I wouldn't say it is (always) that way round. A HDD often makes uncommon noise before it fails and you may even be able to still read it as long as it runs when it fails (do not turn it off then!). When I had a SSD failing, it was completely unavailable from one second to the next and the PC doesn’t recognize it as a drive at all since then.

As always, different people have different experiences and everyone has one manufacturer that works well for them and one they wouldn't buy again. But it seems that SSD controllers rather shut off everything when HDD controllers seem to try the best (or are unable to detect if it is a final failure or still only a upcoming one).

With modern wear leveling, SSDs should know faulty cells early, as they can notice when a cell doesn't work when they write to another cell when wear leveling and they usually have more spare sectors than HDDs. Of course, this does depend on the model and the firmware, too.

Estimates like "x TB to go" or "54% lifetime left" are just estimates. You may get a new drive on warranty as long as smart still reports lifetime left, but that won't help to get lost data back. Make backups, get new drives from time to time and make sure to monitor other smart values that may hint at a degrading drive.

Peter Mortensen
  • 12,090
  • 23
  • 70
  • 90
allo
  • 1,026
  • 1
  • 10
  • 27
  • Two SSDs of the same age in a RAID 1 will have approximately the same number of blocks written and the same age. If they are the same model, they may still fail at the same time. For example, there were models with a firmware bug that [failed after 40,000 hours of use](https://www.cisco.com/c/en/us/support/docs/field-notices/705/fn70545.html). This applies not only to SSDs, but also to HDDs; if you have two of the same model, they may have the same faults. Especially if they are from the same batch in production. – allo Jan 26 '23 at 18:51
1

Every major SSD made in the last decade has spare capacity. The exact amount varies, but's it's roughly 10% when the drive is new. This is necessary because writing to an SSD is destructive. The SSD directs your writes to the spare capacity, and then puts the overwritten parts back into the spare capacity. Wear levelling ensures that all parts of an SSD are equally written. If a write does fail, because a sector has gone bad, it's taken out of the spare capacity, and the write is retried on another spare sector.

This means that the SSD can compare the spare capacity with the bad sector list. If your SSD runs out of spare blocks, it can't write new data anymore. So this ratio is a simple and effective measure of expected lifetime.

But there are other parts of the SSD that can also fail, and these cannot be reliably measured. So this spare capacity-based lifetime is no excuse to skip making backups.

MSalters
  • 8,185
  • 1
  • 23
  • 29
  • While it's correct to say that a SSD has a spare capacity, it's incorrect to present it as a special area in the SSD. It's just that the reported capacity, as seen by the OS, is lower than the actual capacity of the drive: the difference is the spare capacity. In particular, the SSD doesn't *"direct your writes to the spare capacity"*, it just writes anywhere on the SSD (chosing free blocks and according to wear-levelling algorithm). Last, the remaining lifetime attribute is not based at all on the spare capacity left. – PierU Jan 23 '23 at 21:51
-2

SMART was almost useless for HDDs mainly because it could not take into account most failure modes and the processes leading to them.

It did report some potentially useful data (hours powered on, number of power-on cycles, number of bad sectors and a bunch of other numbers that could be used for some basic predictions, but the model parameters varied between models and sometimes even between batches of disks.

For SSDs, it did not improve much.

The volume of the data written is the most important number for an SSD and in theory one can compare it to the promises in the datasheet but, again, some disks go well after 10 times the rated write volume and other die beforehand.


In short, use reputable brands, use RAID, use UPS, make backups and hope for the best.

Peter Mortensen
  • 12,090
  • 23
  • 70
  • 90
fraxinus
  • 1,141
  • 5
  • 6
  • 9
    SMART is _very_ useful for HDDs. Just because it can't detect all failure modes doesn't mean that it's not quite good at detecting many of them, especially when failure is imminent. – forest Jan 23 '23 at 06:50
  • 1
    [Google's hard-drive study](https://static.googleusercontent.com/media/research.google.com/en//archive/disk_failures.pdf) found that SMART could be used to predict about 50% of all hard drive failures with a low rate of false positives, or about 64% with a substantially higher rate of false positives. – Mark Jan 25 '23 at 00:20