Thursday 17 September 2015

Day 335, Disk failure, moments before disk replacement



As is typical in the world of hindsight bias it was completely predictable that one of my pair of raid 0 hard drives should fail while the replacement disks were ready to be installed.

These disks are raid 0, i.e. the whole data partition (or D: drive in this case) is striped across the available hard disks.  This creates a single logical drive across the disks with no resilience or redundancy built in.  Losing any single disk means that all the data is gone.

The disks are a pair of 1TB Seagates which created a roughly 2TB sized partition.  I say roughly as disk manufacturers are adept at counting the number of bytes on a disk in decimal rather than binary, meaning that the actual size of any disk is much less than that advertised.

So the partition was no longer accessible even though one of the physical drives was in perfect health.  Goodbye 1.5TB of data stored on that logical drive.

But, it's not all bad news.  I discovered that this physical disk was failing last week and immediately got a pair of replacements.

Why a pair of replacements when only one was failing?

Well, the failing drive and the working drive were both the same model, bought at the same time, and had the same 2 year warranty.  The drive failed two months after the expiry of the warranty.  The other drive may be as vulnerable to whatever the sudden problem that affected the wounded drive was.  Also, it was worthwhile getting larger drives as they are now not as expensive, also these Western Digital Drives have a 5 year warranty.

Hang on, what about all the data?

That's another part of the "not all bad news" paragraph.  Of course there are backups, backups taken every week, so any saved data on there will be up to date for at worst 6 days ago.  In this case as soon as it started to fail I made a fresh backup and then stopped writing to the disk.

But raid 0, why would you want to do that, I mean you could lose your data at any time if any drive fails?

Yes, but raid 0 improves performance of the logical disk as opposed to a single physical disk and as the data is always backed up it isn't a problem.  Yes it's a faff if a disk fails but they rarely fail in a data losing way - this is the first personal data losing fail from a consumer hard disk I've had in 20 odd years.

Normally my disks are replaced a long time before they fail.  Of course with disks having greater capacity then the type of failure experienced here becomes more likely so I don't expect the next 20 years to be as happy go lucky.  The more sectors on the disk the more there are that can go wrong or can be exposed to problems with data reads and writes, so the bigger the disk the greater the risk.  And many of these problems may be lurking on your disk drives, waiting to bite, because unless you regularly run disk checks you may never touch the data in an affected area for many years.  And if you don't touch the data you won't know it is corrupt and that there is a problem.

I was fortunate in that the error was discovered in a 60GB file, touching a lot of disk blocks and a lot of sectors.

Fortunate, a 60GB file?

Well if I only occasionally looked at small files I'd probably never touched these disk sectors and the disk might have failed before the next backup.  The 60GB file is a virtual machine.  The final backup struggled with the affected file so would have been a give away at least, but that might have triggered drive failure - I actually did trigger disk failure by trying to recover the file in various ways, the drive went offline and only came back after a couple of reboots.

Here are the shiny new disks waiting to be installed.


Obviously my backup system is going to have to breath in a bit if these get full.

Happy happy, joy joy.




For official/internal use only:
7675
0-9

No comments:

Post a Comment