|
Last update: December 6, 2000
Secrets I learned...
about Disaster Recovery
The 82nd floor
Perhaps the most significant contribution that you, as a Systems
Administrator, will make to an organization is something that people
don't like to talk about.
It's called Disaster Recovery.
I have a theory on why people don't like to talk about Disaster
Recovery. People don't like to talk about things that are scary,
particularly if they can avoid talking them. And this is all about
something that's quite scary.
You in this room are professionals. That means you care about what
you do; you accept responsibility for what you do.
And what do you do? You manage and are responsible for the well-being
of a lot of an organization's information. And I don't need to tell
you that, in today's world, information is the principle asset of
most organizations.
A disaster is going to happen. I promise. The only question is
when it is going to happen and what preparations will you have
put in place in anticipation of that day.
In order to prepare, you have to rehearse.
If you havn't rehearsed, you havn't seen the whole thing through.
Do you know what the man who jumped off the Empire State Building
said as he passed the 82nd floor? He said "so far, so good."
There are many different types of disaster. You have to be prepared
for all of them. We'll talk a little about most of them, but, at
the top of the list is a disk failure.
You are...
You are a Systems Administrator at a hospital. The radiology department
takes a lot of X-rays. Those X-rays aren't kept on big pieces of film
anymore. They aren't even kept on micro-film. They are data, kept on
disk.
The disk just crashed.
You can't get it back up again.
The doctor can't get at an X-ray.
Somebody could die.
Do I make myself clear?
You are...
You are the System Administrator hired by a small company.
You have a staff of one. You're it.
Perhaps I just hired you. (Now you're in really big
trouble) I manufacture gizmos. I've been doing it for years and I know
all about gizmos. I've got great plans for the future.
Computers? Oh ya, I know all about a computer. You turn it on and it
works.
- The disk just crashed.
- You can't get me back up again.
That disk is your baby. And I mean baby. Have you ever cared for a
baby? That baby is an absolute miracle of modern technology. But
let me tell you something about that baby.
dust particle
 |
human hair cross- section
 |
Distance between disk surface
and disk read/write head
 |
disk surface
|
I may not fire you. I may not be able to afford to fire you.
I may just close up the whole business.
Do I make myself clear?
You are...
You are being interviewed by a medium size business for a job as
Systems Administrator. They are telling you about their great plans
for the future. You're not hearing much about their Disaster Recovery
plans.
You should be concerned.
I've seen a disk that's gone toast. And I mean toast. You could smell
it.
So what happens if you can't get it back up and running?
- They can't process any new orders.
- If they were doing e-commerce, they can't even get any new
orders.
- They can't process any of their Accounts Receivable to collect
money.
- They can't provide the reports the bank wants for their line of
credit.
- They can't process any of their Payroll to pay any of their
employees.
All of a sudden, they're in the same shape as that disk.
"Oh that's ridiculous", you say. "It's not that fragile."
Disks are hermetically sealed. I'll tell you a secret. It's a lie.
How fast does a disk spin? About 10,000rpm? Think about it. That means
that, every second, that hunk of metal is turning around about 160
times.
There are a bunch of round ball bearings in there. Let me tell you
another secret. There's no such thing as a perfectly round ball
bearing.
Remember how close that hunk of metal was to its read/write head. I
sure hope that disk doesn't develop a wobble.
Do I make myself clear?
You are...
You are a student, grapling with the sizable task of learning about
a lot of things in anticipation of your future role as a Systems
Administrator. Every day, to and from school, you carry around your
hard disk. How much do you think about it?
I have one too.
- I have a special case for it.
- Inside the case is plastic bubble wrap.
- Inside the plastic bubble wrap is a cardboard box lined with foam.
- Inside the foam is the disk.
- When I come the class on a cold day I make sure the disk has had time
to warm up to room temperature before I put it in the drive.
- Did you also know that I bought two of them?
- Did you also know that any important work I'm doing is backed up
to floppy or to another machine?
It IS going to happen.
Two Aspects to Disaster Planning
- The first is what to do in anticipation of a disaster.
- The second is what to do WHEN the disaster happens.
In terms of disk data, which is probably the only thing that makes
your site different from any other site. We hear the term "Backup and
Restore". Usually that involves some method of copying data to tape
or another disk.
People say "Backup and Restore".
Listen carefully.
Are they really saying "BACKUP
(and restore)"?
Certainly, if you have no backup, you cannot restore.
However, even if you have a backup, you may still be unable to
restore.
If that's the case, what's the value of the Backup?
You tell me that you plan to implement a backup of your system
every day. After three years you will have done have about 1,000
backups. Fine. That's wonderful.
Now I ask you this question. During those 3 years, how many full
dress rehearsals will you have done of a complete restore? How
many times will you have, in the very least, removed the disk from your
system and completely re-created your system from one of those
backups?
If your answer is silence, I'm not going to hire you.
Those 1,000 backups are pretty close to useless if you've never done
a complete restore.
Panic
Now, when things are quiet, now when people are calm and collected, now
is the time to deal with "panic". Not when all hell breaks loose
and the panic really happens.
When the panic really happens, you're going to be worried and
upset. People are going to be milling around you saying intelligent
things like "Is it fixed yet?". Your daughter's school play is tonight.
You're probably going to miss it.
So this is NOT the time to be asking yourself
"hmmmm... how do I do this?"
End of pitch
I get an "A" if I've been able to make you feel uneasy.
That's a strange thing to say... After all, recognizing the
problem is only the first step. So I'll tell you why.
You are not a dummy. You know that, if you have to find out how
to do something, then you can find out how to do that something.
I will have succeeded if I've convinced you that this is serious
business and that you have to address the two aspects of Disaster
Planning.
That's my pitch. That's the important part of what I want to say.
I've done my best to strike the fear of the Almighty into you.
Anything after this is just detail.
For the record, here is some of that detail.
Your Lifeboat
Sitting in your lifeboat, watching the Titanic disappear beneath the
waters, you'll need your folder of all the information about what that
great ship once looked like.
- It contains a complete inventory listing of all the hardware that's in
the machine that just went down. It contains instructions on how you
go about replacing every single piece of that machine.
(Panic time is not the time to go out shopping.)
- It's labelled in 3-inch high letters: "Don't Panic".
- It ought to have step by step recorded instructions on exactly what
you'll need to do. I suggest you make those instructions as clear and
straightforward as possible. Remember: when the time comes to rely
on those instructions, a bunch of people may be in panic mode.
You'll probably be upset. Give yourself all the help you possibly can.
- Contained are instructions on where you'll find an appropriate blank
hard disk. If the disaster was a disk crash, you may only need a new
disk.
- That information needs to be kept up to date.
- If you keep that folder in the computer room, it won't help you
much if the computer room burns down. You need at least two copies.
- You're going to have to boot this new machine. You need a floppy
prepared to do so. Again, you need at least two copies of that floppy.
You can't do this alone.
You can't do this alone. This is going to take time and money.
You need co-operation. You need budget and project approval.
After you've dreamed up every possible disaster scenario you can
imagine, you need to ask someone else to do the same. See if you've
missed anything.
You need a dress rehearsal.
You absolutely need to simulate a real disaster.
Actually, you need to do two dress rehearsals.
After you've done
the first one, you need to have someone else follow
your written instructions
and do a second one. (You'll need this second test if you ever want
the flexibility of attending your daughter's school play.)
Personal tip to students
You're going to be having some job interviews. This is my own
suggestion. You don't have to follow it. It's only my personal
suggestion.
You know that time in the interview when you get asked "Do you have
any questions?"
Depending on what has or hasn't been said, I'd ask:
"What are your Disaster Plans".
And listen carefully to the answer.
If you feel uneasy about the answer you get, make that concern known.
If you can't get the co-operation and committment you need to rehearse
a disaster -- a disk crash at least -- then you have to ask yourself,
"Do I really want to work for these people?"
Two Types of Backups
There are 2 different reasons for backups. One is in anticipation of
a complete system's restoration. That's what we're talking about here.
The other is in anticipation of a particular user saying something
like "I screwed up a document I was working on. Can you restore it
from last night's backup for me please?"
This second reason requires different considerations. It may be
sufficient and practical to backup selected directories or
directory trees for this purpose. In fact, it may not even be necessary
to back these up to tape. You might tar or perhaps
tar and gzip them to some other place on the same
disk.
My focus here is on the complete system scenario. And, by the way,
THE book to get is O'Reilly's "Unix Backup and Recovery"
So many scenarios
There are many ways to address the backup issue because there are
so many very different sets of needs.
Do you have one computer or one hundred?
In a many-system scenario, do some have different backup needs than
others? Are some mission critical and others not? Are some Unix and
some Windows and some Macs? Are they networked together? Is the
network IP based?
It may make sense, both administratively and economically, to have a
central machine whose only role is to back up the others.
There are too many scenarios to discuss here. Each deserves a
full discussion. So let's start at the beginning: just one machine,
running Linux.
Image vs File
The first issue is whether to do an image backup or a file backup.
At first blush, the image backup seems easier.
- To do the backup, you make an exact byte for byte copy of the entire
disk to a tape.
- To do the restore, you copy that tape, byte for byte back to a disk.
But that may not be a good idea.
- In order to do an image backup, you need to make sure that no
changes might be happening to the disk while you're creating that
system. That means no one else can be using it. That may be impractical.
Partitions have to be mounted "read only".
- It's going to take more time. It's going to require more tape. Remember
that when you're backing up an entire disk, you're also backing
up those parts of it that aren't really being used at all.
- You're backing up the MBR. The MBR points to specific disk locations,
ie: cylinder, head, track. That means that the new disk to which you
want to restore had better be an exact replica.
File-by-File recommended
Consider then the file-by-file alternative. Be advised however that some
applications, such as an Oracle database may not use the file structure!
I'm not going to talk about that now, but it's something you should
keep in mind.
As always in Unix, there is more than one way to do it. (I'd make
an acronym out of that, but someone has already beaten me to it.)
For example, consider the issue of User Account Administration.
- There are various relevant files and directories to this task files
such as /etc/passwd and /home. The task can
actually be accomplished with little more than manipulating them.
- Helpful utilities are available to assist you, such as
useradd.
- Fancier front ends that are comprehensive and more friendly such as
Red Hat's linuxconf are also available. They may, in fact, do more
than "ordinary" User Account administration.
Although these front-ends may be make things
easier to do, a knowledge of the underlying things is just as important.
The same is true of backups.
Unfortunately, unlike User Account Maintenance, there are many more
issues to consider here. But the principle is the same. And, all the
moreso, you need to know what's going on.
Meta Data
Meta Data is "data about data". Suppose you have a very small
partition and copy all its files to a diskette. Could you restore
that partition slice onto a hard disk from the data on that diskette?
No.
Although all the files are there, the information about the partition
itself is not. What's missing is "the data about that data".
The command fdisk -l lists partitions, start and end cylinders,
block sizes, ids and types. That's meta data. But that information is
no where to be found on the floppy you just made.
Print it out. Put a copy in each of your (minimum) two folders.
No Recipe
There is no recipe on how to do a backup and restore.
The details are too varied.
In particular, it's very dependent on your precise needs and
situation. Here are some ideas.
Install and install on top of install
One approach is to re-install from the distribution CDROMs and then
to restore your backed-up data on top of that. A problem with this
approach is that you have to exactly duplicate all the options you
chose when you did that install in the first place.
Red Hat provides a utility called "kick-start" which records the
details of your installation.
A friend of mine swears by kickstart.
- Install the mkkickstart rpm.
- do mkkickstart >/tmp/ks.cfg
- Fix up the ks.cfg file and put it on a floppy.
- Boot the install CD, and say "linux ks=floppy" on the first screen.
- Watch it install everything, no questions to answer.
He claims that:
"If you are set up with kickstart, you don't need to do a full restore.
You just have to install your user files, customize a few things in
/etc, and add the rpms that you did not get from the redhat cds."
I don't recommend such a strategy. In any but the simplest installation,
how would you know those "few things" he mentions? To be safe,
you'd have to backup everything. Thus, not only would you be doing the
re-install, but you'd then be going through a process of overwriting
everything you'd just installed.
The whole 9 yards
I'd sooner recommend the creation a bootable floppy with which you
essentially have all the standard Unix you really need together with
all your specific data on tape.
Tom's root boot really squeezes just about everything you need onto
a single floppy. In fact, he cheats a little by taking advantage of
the capabilities of relatively current hardware which is capable of
storing more than 1.44MB on a floppy.
The CDROM that comes with O'Reilly's "Unix Backup and Restore" by
W. Curtis Preston includes a copy. It can also be freely downloaded.
Bar Metal tools
It can be argued that all the tools you need to properly do a
backup (that's restorable) are available already in Unix.
dd
As you know, dd can be used to create a bootable floppy. You're
going to need to do that because you're going to need to create a
bootable floppy since you're going to have a hard disk with nothing
on it.
If your data is not kept in a file-system, you may need to use
dd to copy it as an entire device.
cpio
cpio is perhaps the grand-daddy of utilities used for this purpose.
Recently, however, tar has gained more general community acceptance.
Versions of cpio from system to system have a different syntax to
accomplish the same thing. Tar is more consistant.
Tar is also more portable. It's even understandable by Windows
systems.
tar
There are a few things that cpio can do that tar can't but that's
becoming untrue. The GNU version of tar pretty much closes the gap
entirely. Some of these issues are subtle, such as the ability to
to re-establish initial creation-time, last modification-time and
last access-time on a restored file.
tar (or cpio) is a good choice where non-entire-system backups are
concerned.
dump
Dump and restore represent a higher level tool.
With dump, we introduce the notion of "backup levels". A level 0
backup means the whole thing. Period. But a level 1 backup means
"things that have changed since the last time a level 0 was done".
A level 1 backup would take less time. A level 2 backup
means "things since the last time a level 1 was done". And so on.
The text editable dumpdates file contains one line for each partition
you have. (Thus, another reason you should have multiple slices.)
For each partition, dumpdates records the level and dates of the last
time you backed it up.
Before dump creates its output, it does a lot of work, first
determining the files it intends to access. Having computed this,
it's first output is a "table of contents". When you later want to
restore something, this table of contents is readily available to
restore so it can show you what it's got. Restore allows you to do such
things as cd and ls.
This is THE major difference between dump and cpio/tar.
This two-pass process can create a problem.
During its initial phase, dump makes note of each file it
intends to copy during its next phase. But there may be considerable
time between the two phases. (Copying to tape is slow.) By the time
dump gets around to actually doing the copy, that file may have
changed.
One of the features of dump is its concern regarding the
access times (creation, access, modification) that are associated
with each file. Dump doesn't access files in the "normal"
way so it itself can avoid causing those file times to be changed.
There is still, even with the other sources listed below,
going to be some chance of file corruption on an active
(ie: mounted for write) filesystem that's being backed up.
Tape access takes more time than disk access. Corruption problems are
directly proportional to time. If you have the disk space, one
consideration is to direct the output of the backup to the disk
itself which will be faster. Then, you can copy this temporary disk
to the tape.
Why bother with other sources
There are quite a number of 3rd party packages. They are not
just prettier. Indeed, once you've got your tape backup procedure
to "work", you might think you're safe. And you may not be.
Unlike disks, the tape read/write mechanism comes into physical contact
with the tape. Wear and tear become a factor. So what works today
may fail tomorrow. Some of these people go a step or two further
in terms of tape verification and bad-spot recovery than the normal
bare metal utilities.
AMANDA
AMANDA, the Advanced Maryland Automated Network Disk Archiver is
a public domain utility. It's quite sophisticated (especially
considering it's public domain) and has a substantial following.
AMANDA addresses the multiple-machine challenge, allowing you to
set up a master backup server to back up many other machines.
"Unix Backup and Restore" discusses AMANDA in detail. Read what
Preston has to say about them.
AMANDA users are a self-help group, with a sizable mailing list.
Don't expect to get much help from them until you know what you're
talking about and have read Preston's book in the least.
BRU
BRU, Backup and Recovery Utility, is the chief product of the company
named Enhanced Software Technologies. (The name of the product is
known better than the name of the company.)
EST initially developed BRU for Unix in 1985 and BRU for Linux in 1994.
Operating out of Phoenix Arizona, they are a friendly group and are
anxious to help. For the novice (to say the least) I'd recommend that
you give them serious consideration.
On December 4, 2000, I received an email from
Larry Bernstein, EST Sales Manager
demonstrating the company's attitude. I had chatted with him
regarding BRU and the Linux Systems Administrator course at Seneca.
You can read that letter (which I html-ized)
and see something of that attitude.
More than just words, Larry is shipping a "Not For Resale" copy of the
full BRU package for use at Seneca. Please ask about it in January.
From their web site, the document entitled
BRU vs. Common
Unix Utilities notes (in particular) that BRU performs verification
and bad-spot recovery beyond that of bare metal Unix tools.
Parting notes
There cannot be any "conclusions" arising from my writing of this
document. Hopefully, however, there will be some conclusions to be
drawn from reading it..
The principle points that I attempted to cover were
- This is very serious business. Will you take it seriously?
- Having adopted an appropriate attitude, what are some of the steps
you should follow next?
- What are (quite briefly) some of the technical bare bones details?
- What are some additional sources to pursue?
Sources
|