Last update: December 6, 2000

Secrets I learned... about Disaster Recovery

The 82nd floor
Perhaps the most significant contribution that you, as a Systems Administrator, will make to an organization is something that people don't like to talk about.

It's called Disaster Recovery.

I have a theory on why people don't like to talk about Disaster Recovery. People don't like to talk about things that are scary, particularly if they can avoid talking them. And this is all about something that's quite scary.

You in this room are professionals. That means you care about what you do; you accept responsibility for what you do.

And what do you do? You manage and are responsible for the well-being of a lot of an organization's information. And I don't need to tell you that, in today's world, information is the principle asset of most organizations.

A disaster is going to happen. I promise. The only question is when it is going to happen and what preparations will you have put in place in anticipation of that day.

In order to prepare, you have to rehearse. If you havn't rehearsed, you havn't seen the whole thing through. Do you know what the man who jumped off the Empire State Building said as he passed the 82nd floor? He said "so far, so good."

There are many different types of disaster. You have to be prepared for all of them. We'll talk a little about most of them, but, at the top of the list is a disk failure.

You are...
You are a Systems Administrator at a hospital. The radiology department takes a lot of X-rays. Those X-rays aren't kept on big pieces of film anymore. They aren't even kept on micro-film. They are data, kept on disk.

The disk just crashed.
You can't get it back up again.
The doctor can't get at an X-ray.
Somebody could die.
Do I make myself clear?

You are...
You are the System Administrator hired by a small company. You have a staff of one. You're it.
Perhaps I just hired you. (Now you're in really big trouble) I manufacture gizmos. I've been doing it for years and I know all about gizmos. I've got great plans for the future.

Computers? Oh ya, I know all about a computer. You turn it on and it works.

  • The disk just crashed.
  • You can't get me back up again.
That disk is your baby. And I mean baby. Have you ever cared for a baby? That baby is an absolute miracle of modern technology. But let me tell you something about that baby.
dust
particle
human
hair
cross-
section
Distance between disk surface and disk read/write head
disk surface
I may not fire you. I may not be able to afford to fire you. I may just close up the whole business.

Do I make myself clear?

You are...
You are being interviewed by a medium size business for a job as Systems Administrator. They are telling you about their great plans for the future. You're not hearing much about their Disaster Recovery plans.

You should be concerned.

I've seen a disk that's gone toast. And I mean toast. You could smell it.

So what happens if you can't get it back up and running?

  • They can't process any new orders.
  • If they were doing e-commerce, they can't even get any new orders.
  • They can't process any of their Accounts Receivable to collect money.
  • They can't provide the reports the bank wants for their line of credit.
  • They can't process any of their Payroll to pay any of their employees.
All of a sudden, they're in the same shape as that disk.

"Oh that's ridiculous", you say. "It's not that fragile."

Disks are hermetically sealed. I'll tell you a secret. It's a lie.

How fast does a disk spin? About 10,000rpm? Think about it. That means that, every second, that hunk of metal is turning around about 160 times.

There are a bunch of round ball bearings in there. Let me tell you another secret. There's no such thing as a perfectly round ball bearing.

Remember how close that hunk of metal was to its read/write head. I sure hope that disk doesn't develop a wobble.

Do I make myself clear?

You are...
You are a student, grapling with the sizable task of learning about a lot of things in anticipation of your future role as a Systems Administrator. Every day, to and from school, you carry around your hard disk. How much do you think about it?

I have one too.

  • I have a special case for it.
  • Inside the case is plastic bubble wrap.
  • Inside the plastic bubble wrap is a cardboard box lined with foam.
  • Inside the foam is the disk.
  • When I come the class on a cold day I make sure the disk has had time to warm up to room temperature before I put it in the drive.
  • Did you also know that I bought two of them?
  • Did you also know that any important work I'm doing is backed up to floppy or to another machine?
It IS going to happen.
Two Aspects to Disaster Planning
  • The first is what to do in anticipation of a disaster.
  • The second is what to do WHEN the disaster happens.
In terms of disk data, which is probably the only thing that makes your site different from any other site. We hear the term "Backup and Restore". Usually that involves some method of copying data to tape or another disk.

People say "Backup and Restore".
Listen carefully.
Are they really saying "BACKUP (and restore)"?

Certainly, if you have no backup, you cannot restore.
However, even if you have a backup, you may still be unable to restore.
If that's the case, what's the value of the Backup?

You tell me that you plan to implement a backup of your system every day. After three years you will have done have about 1,000 backups. Fine. That's wonderful.

Now I ask you this question. During those 3 years, how many full dress rehearsals will you have done of a complete restore? How many times will you have, in the very least, removed the disk from your system and completely re-created your system from one of those backups?

If your answer is silence, I'm not going to hire you.

Those 1,000 backups are pretty close to useless if you've never done a complete restore.

Panic
Now, when things are quiet, now when people are calm and collected, now is the time to deal with "panic". Not when all hell breaks loose and the panic really happens.

When the panic really happens, you're going to be worried and upset. People are going to be milling around you saying intelligent things like "Is it fixed yet?". Your daughter's school play is tonight. You're probably going to miss it.

So this is NOT the time to be asking yourself

"hmmmm... how do I do this?"
End of pitch
I get an "A" if I've been able to make you feel uneasy. That's a strange thing to say... After all, recognizing the problem is only the first step. So I'll tell you why.

You are not a dummy. You know that, if you have to find out how to do something, then you can find out how to do that something.

I will have succeeded if I've convinced you that this is serious business and that you have to address the two aspects of Disaster Planning. That's my pitch. That's the important part of what I want to say. I've done my best to strike the fear of the Almighty into you. Anything after this is just detail.

For the record, here is some of that detail.

Your Lifeboat
Sitting in your lifeboat, watching the Titanic disappear beneath the waters, you'll need your folder of all the information about what that great ship once looked like.
  • It contains a complete inventory listing of all the hardware that's in the machine that just went down. It contains instructions on how you go about replacing every single piece of that machine. (Panic time is not the time to go out shopping.)
  • It's labelled in 3-inch high letters: "Don't Panic".
  • It ought to have step by step recorded instructions on exactly what you'll need to do. I suggest you make those instructions as clear and straightforward as possible. Remember: when the time comes to rely on those instructions, a bunch of people may be in panic mode. You'll probably be upset. Give yourself all the help you possibly can.
  • Contained are instructions on where you'll find an appropriate blank hard disk. If the disaster was a disk crash, you may only need a new disk.
  • That information needs to be kept up to date.
  • If you keep that folder in the computer room, it won't help you much if the computer room burns down. You need at least two copies.
  • You're going to have to boot this new machine. You need a floppy prepared to do so. Again, you need at least two copies of that floppy.
You can't do this alone.
You can't do this alone. This is going to take time and money. You need co-operation. You need budget and project approval.

After you've dreamed up every possible disaster scenario you can imagine, you need to ask someone else to do the same. See if you've missed anything.

You need a dress rehearsal.

You absolutely need to simulate a real disaster.

Actually, you need to do two dress rehearsals. After you've done the first one, you need to have someone else follow your written instructions and do a second one. (You'll need this second test if you ever want the flexibility of attending your daughter's school play.)

Personal tip to students
You're going to be having some job interviews. This is my own suggestion. You don't have to follow it. It's only my personal suggestion.

You know that time in the interview when you get asked "Do you have any questions?"

Depending on what has or hasn't been said, I'd ask:

"What are your Disaster Plans".
And listen carefully to the answer.
If you feel uneasy about the answer you get, make that concern known. If you can't get the co-operation and committment you need to rehearse a disaster -- a disk crash at least -- then you have to ask yourself,
"Do I really want to work for these people?"
Two Types of Backups
There are 2 different reasons for backups. One is in anticipation of a complete system's restoration. That's what we're talking about here.

The other is in anticipation of a particular user saying something like "I screwed up a document I was working on. Can you restore it from last night's backup for me please?"

This second reason requires different considerations. It may be sufficient and practical to backup selected directories or directory trees for this purpose. In fact, it may not even be necessary to back these up to tape. You might tar or perhaps tar and gzip them to some other place on the same disk.

My focus here is on the complete system scenario. And, by the way, THE book to get is O'Reilly's "Unix Backup and Recovery"

So many scenarios
There are many ways to address the backup issue because there are so many very different sets of needs.

Do you have one computer or one hundred?

In a many-system scenario, do some have different backup needs than others? Are some mission critical and others not? Are some Unix and some Windows and some Macs? Are they networked together? Is the network IP based?

It may make sense, both administratively and economically, to have a central machine whose only role is to back up the others.

There are too many scenarios to discuss here. Each deserves a full discussion. So let's start at the beginning: just one machine, running Linux.

Image vs File
The first issue is whether to do an image backup or a file backup. At first blush, the image backup seems easier.
  1. To do the backup, you make an exact byte for byte copy of the entire disk to a tape.
  2. To do the restore, you copy that tape, byte for byte back to a disk.
But that may not be a good idea.
  • In order to do an image backup, you need to make sure that no changes might be happening to the disk while you're creating that system. That means no one else can be using it. That may be impractical. Partitions have to be mounted "read only".
  • It's going to take more time. It's going to require more tape. Remember that when you're backing up an entire disk, you're also backing up those parts of it that aren't really being used at all.
  • You're backing up the MBR. The MBR points to specific disk locations, ie: cylinder, head, track. That means that the new disk to which you want to restore had better be an exact replica.
File-by-File recommended
Consider then the file-by-file alternative. Be advised however that some applications, such as an Oracle database may not use the file structure! I'm not going to talk about that now, but it's something you should keep in mind.

As always in Unix, there is more than one way to do it. (I'd make an acronym out of that, but someone has already beaten me to it.)

For example, consider the issue of User Account Administration.

  • There are various relevant files and directories to this task files such as /etc/passwd and /home. The task can actually be accomplished with little more than manipulating them.
  • Helpful utilities are available to assist you, such as useradd.
  • Fancier front ends that are comprehensive and more friendly such as Red Hat's linuxconf are also available. They may, in fact, do more than "ordinary" User Account administration.
Although these front-ends may be make things easier to do, a knowledge of the underlying things is just as important.

The same is true of backups.

Unfortunately, unlike User Account Maintenance, there are many more issues to consider here. But the principle is the same. And, all the moreso, you need to know what's going on.

Meta Data
Meta Data is "data about data". Suppose you have a very small partition and copy all its files to a diskette. Could you restore that partition slice onto a hard disk from the data on that diskette?

No.

Although all the files are there, the information about the partition itself is not. What's missing is "the data about that data".

The command fdisk -l lists partitions, start and end cylinders, block sizes, ids and types. That's meta data. But that information is no where to be found on the floppy you just made.

Print it out. Put a copy in each of your (minimum) two folders.

No Recipe
There is no recipe on how to do a backup and restore.
The details are too varied.
In particular, it's very dependent on your precise needs and situation. Here are some ideas.
Install and install on top of install
One approach is to re-install from the distribution CDROMs and then to restore your backed-up data on top of that. A problem with this approach is that you have to exactly duplicate all the options you chose when you did that install in the first place.

Red Hat provides a utility called "kick-start" which records the details of your installation.

A friend of mine swears by kickstart.

  1. Install the mkkickstart rpm.
  2. do mkkickstart >/tmp/ks.cfg
  3. Fix up the ks.cfg file and put it on a floppy.
  4. Boot the install CD, and say "linux ks=floppy" on the first screen.
  5. Watch it install everything, no questions to answer.
He claims that:
"If you are set up with kickstart, you don't need to do a full restore. You just have to install your user files, customize a few things in /etc, and add the rpms that you did not get from the redhat cds."

I don't recommend such a strategy. In any but the simplest installation, how would you know those "few things" he mentions? To be safe, you'd have to backup everything. Thus, not only would you be doing the re-install, but you'd then be going through a process of overwriting everything you'd just installed.

The whole 9 yards
I'd sooner recommend the creation a bootable floppy with which you essentially have all the standard Unix you really need together with all your specific data on tape.

Tom's root boot really squeezes just about everything you need onto a single floppy. In fact, he cheats a little by taking advantage of the capabilities of relatively current hardware which is capable of storing more than 1.44MB on a floppy.

The CDROM that comes with O'Reilly's "Unix Backup and Restore" by W. Curtis Preston includes a copy. It can also be freely downloaded.

Bar Metal tools
It can be argued that all the tools you need to properly do a backup (that's restorable) are available already in Unix.
dd
As you know, dd can be used to create a bootable floppy. You're going to need to do that because you're going to need to create a bootable floppy since you're going to have a hard disk with nothing on it.

If your data is not kept in a file-system, you may need to use dd to copy it as an entire device.

cpio
cpio is perhaps the grand-daddy of utilities used for this purpose. Recently, however, tar has gained more general community acceptance. Versions of cpio from system to system have a different syntax to accomplish the same thing. Tar is more consistant.

Tar is also more portable. It's even understandable by Windows systems.

tar
There are a few things that cpio can do that tar can't but that's becoming untrue. The GNU version of tar pretty much closes the gap entirely. Some of these issues are subtle, such as the ability to to re-establish initial creation-time, last modification-time and last access-time on a restored file.

tar (or cpio) is a good choice where non-entire-system backups are concerned.

dump
Dump and restore represent a higher level tool.

With dump, we introduce the notion of "backup levels". A level 0 backup means the whole thing. Period. But a level 1 backup means "things that have changed since the last time a level 0 was done". A level 1 backup would take less time. A level 2 backup means "things since the last time a level 1 was done". And so on.

The text editable dumpdates file contains one line for each partition you have. (Thus, another reason you should have multiple slices.) For each partition, dumpdates records the level and dates of the last time you backed it up.

Before dump creates its output, it does a lot of work, first determining the files it intends to access. Having computed this, it's first output is a "table of contents". When you later want to restore something, this table of contents is readily available to restore so it can show you what it's got. Restore allows you to do such things as cd and ls. This is THE major difference between dump and cpio/tar.

This two-pass process can create a problem. During its initial phase, dump makes note of each file it intends to copy during its next phase. But there may be considerable time between the two phases. (Copying to tape is slow.) By the time dump gets around to actually doing the copy, that file may have changed.

One of the features of dump is its concern regarding the access times (creation, access, modification) that are associated with each file. Dump doesn't access files in the "normal" way so it itself can avoid causing those file times to be changed.

There is still, even with the other sources listed below, going to be some chance of file corruption on an active (ie: mounted for write) filesystem that's being backed up.

Tape access takes more time than disk access. Corruption problems are directly proportional to time. If you have the disk space, one consideration is to direct the output of the backup to the disk itself which will be faster. Then, you can copy this temporary disk to the tape.

Why bother with other sources
There are quite a number of 3rd party packages. They are not just prettier. Indeed, once you've got your tape backup procedure to "work", you might think you're safe. And you may not be.

Unlike disks, the tape read/write mechanism comes into physical contact with the tape. Wear and tear become a factor. So what works today may fail tomorrow. Some of these people go a step or two further in terms of tape verification and bad-spot recovery than the normal bare metal utilities.

AMANDA
AMANDA, the Advanced Maryland Automated Network Disk Archiver is a public domain utility. It's quite sophisticated (especially considering it's public domain) and has a substantial following.

AMANDA addresses the multiple-machine challenge, allowing you to set up a master backup server to back up many other machines. "Unix Backup and Restore" discusses AMANDA in detail. Read what Preston has to say about them.

AMANDA users are a self-help group, with a sizable mailing list. Don't expect to get much help from them until you know what you're talking about and have read Preston's book in the least.

BRU
BRU, Backup and Recovery Utility, is the chief product of the company named Enhanced Software Technologies. (The name of the product is known better than the name of the company.)

EST initially developed BRU for Unix in 1985 and BRU for Linux in 1994. Operating out of Phoenix Arizona, they are a friendly group and are anxious to help. For the novice (to say the least) I'd recommend that you give them serious consideration.

On December 4, 2000, I received an email from Larry Bernstein, EST Sales Manager demonstrating the company's attitude. I had chatted with him regarding BRU and the Linux Systems Administrator course at Seneca. You can read that letter (which I html-ized) and see something of that attitude.

More than just words, Larry is shipping a "Not For Resale" copy of the full BRU package for use at Seneca. Please ask about it in January.

From their web site, the document entitled BRU vs. Common Unix Utilities notes (in particular) that BRU performs verification and bad-spot recovery beyond that of bare metal Unix tools.

Parting notes
There cannot be any "conclusions" arising from my writing of this document. Hopefully, however, there will be some conclusions to be drawn from reading it.. The principle points that I attempted to cover were
  1. This is very serious business. Will you take it seriously?
  2. Having adopted an appropriate attitude, what are some of the steps you should follow next?
  3. What are (quite briefly) some of the technical bare bones details?
  4. What are some additional sources to pursue?

Sources