Friday 13 February 2009

The Electronic Charleston – backup thoughts 1

The dance we do when we apply old ideas to new technology.

A current task is investigating new backup methods for our systems and data, looking at how we should approach maintaining backup for the purposes of recovery of data. Data may need to be recovered in the event of data loss due to disaster affecting machine rooms, individual machine disk failures or major failure of the SAN.


Some questions that have been tossed around are:

Why are we backing data up?

Are we doing it this way because we have always done it this way?

Now we have new technologies should we be doing it differently?

Are there other technologies we could be using?


Some background.

The first line of protection is the resilience of the systems themselves, all machines are running some form of raid, either raid 1, 5 or 6. This protects against disk loss, in the case of raid 5 a single disk loss can be survived, with raid 1 it may be a single disk or more that can be survived depending on the configuration and which disks in the raid set fail, with dual parity raid 6 two disks can be lost.

In the case of raid 1 and 5 the system is no longer resistant to failure once a disk has failed and the system will continue to be at heightened risk until the failed disk has been replaced and the data has been rebuilt on the replaced disk. This window of risk during disk failure and rebuilding is why many SAN systems use variants of raid 6, given the probability of a second disk failure during the risk period, the risk being increased the more disks there are in the raid array. Other components can fail, an array controller may decide to invert all the bits being written or add or subtract bits randomly or play tunes using the disk read/write heads, all of which may leave the data in an unusable state. These things may randomly bite us on the backside, the hope is that those problems are discovered quickly.

The second line of protection is mirroring data between sites, typically only available to data stored on the SAN. If one of the machine rooms gets vapourised by an alien death ray we can mount the mirrored data on spare or less used hardware at the second site, data is categorised with different priorities and the higher priority data will be higher on the list to be returned to service.

Tape backup is not a level of protection in the sense above but is a method for storing data from a reasonably recent time that may need to be recovered in the event of the first levels of protection failing, or in the event that data has been deleted for whatever reason and which is now needed. The current terminology for the reasonable amount of time since the last backup is the Recovery Point Objective (RPO), or how much data loss can be tolerated. In practice with backup tape this is usually the previous nights backup, if the data is lost at any point up to the next backup being written we can roll back to data from last night from tape, so the RPO is 24 hours.

There is an assumption that the backup has been successful. Not just from the point of view of the backup completion notice announcing a success, but that the tape write head isn't misaligned and not writing bogus data, that the tape drive was operating correctly, the tape cassette is intact and the tape hasn't been chewed by tape weevils or been stretched or that the tension is correct, that the tape has historically been stored upright rather than side on, the fire safe has been kept at the correct temperature and humidity, and perhaps any number of other things that I am completely unaware of.

One option available that moves away from direct backup to tape, is to use the tape backup concept but using virtual tape and virtual tape drives configured on a disk array or Virtual Tape Library (VTL). There are benefits in that existing backup software can be used, but the idea is the one that gives me the Electronic Charleston heebie-jeebies, there are better ways of using backup to disk rather than using a hammer to drive home a screw. But VTL won't be ruled out and it will be looked at fairly, my opinions may not be valid not an unlikely occurrence.


No comments:

Post a Comment