Ten Key Steps for Business Continuity/Disaster Recovery Planning

No matter how many different scenarios you test your systems and network administrators on during dry runs, when disaster strikes you can be sure it happens in way nobody foresaw. Your applications, servers, networks, and storage are monitored continuously by sophisticated management tools, alerting operators when certain limits are exceeded; hard disk storage capacity, network bandwidth, or UPS battery life, for example. However, while a wide range of risks can be measured and monitored, effective response depends on well-designed plans and procedures, as well as – on some occasions – the knowledge of a skilled and experienced sysadmin and their “personal” toolkit.
Standard IS management methodologies such as ITIL recommend documenting every change to a system with regular inspections and testing. The system evolves in a virtual circle of improvements, an approach tried and tested in manufacturing and service industry quality circles as well as in physical access security implementations.
Efficient and predictable – but business continuity or disaster recovery planning needs to be ready to respond to a wide range of unpredictable events. Businesses, from board level to end users, depend on their IT systems. An effective disaster recovery plan allows the IS team to commit to restoring service within a given timescale, giving end users and management confidence that IT is working with them to support corporate business goals.
There are ten key stages in rolling out an effective disaster recovery plan – see the table below. Three points are worth looking at a little more closely.
When major organizational changes occur (line 3 in the table), you'll need to review business-critical data and applications. The business impact of any interruption must be evaluated, and the risk ranked according to severity and probability. Only with this information can failover processes be defined to ensure the most critical physical and virtual servers are prioritized.
When major technical changes occur (line 7 in the table), it's essential to review the main site and backup site infrastructure – it's often necessary to update certain admin parameters – to ensure successful failover when it's needed. Allocating an appropriate budget to monitoring tools is key to success, even if open source solutions such as Nagios® help reduce outlays. Large corporations with many legacy systems to support frequently have multiple systems management platforms running side by side. This can lead to interoperability issues when failover occurs, and full replication of critical servers may be the optimum solution.
A final point, the full disaster recovery process needs to be tested thoroughly once or twice a year in a simulated incident – including the (temporary) transfer of operators and administrators to the backup facility.
# |
Description |
Secondary site? |
Frequency |
Notes |
1 |
Define the business needs and objectives |
No |
Once or twice a year |
List risks, probability, severity |
2 |
Define priorities according to risk probability and impact severity |
Possibly |
Once or twice a year |
Measure the risks |
3 |
Identify business-critical applications and data |
No |
Review after any major organizational change |
Limit the impact of key risks by measuring outcomes |
4 |
Define maximum allowable data loss target: one hour, one day, other? |
No |
Once or twice a year |
Set data backup objectives |
5 |
Define maximum allowable downtime |
Possibly |
Once or twice a year |
Set recovery time objective |
6 |
Document disaster recovery/business continuity plan requirements |
Yes |
Once or twice a year |
Set goals for improvement |
7 |
Select the appropriate failover infrastructure and management solutions |
Yes |
Review after any major organizational change |
Don't overlook legacy systems; define individual roles and responsibilities (eg, principal contacts) |
8 |
Implement training plans and exercises for disaster recovery staff |
Possibly |
Once or twice a year |
Get users involved |
9 |
Regularly run complete system failover and recovery tests |
Yes |
Once or twice a year |
Use simulated “real life” incidents |
10 |
Keep the Disaster Recovery/Business Continuity Plan up to date |
Yes |
Review after any major organizational change |
If necessary, restart the entire planning process from step 1. |
|
|
Automatic failover is essential
The ultimate validation for a disaster recovery plan is a “real life” test. A network outage that would result in a “split brain” situation can be tested by breaking the network link between the primary and secondary server, and verifying that no information has been lost. Equally important, server power failure can be tested by cutting power to the server and measuring the time until the server and applications are back on line. This shouldn't take more than a few minutes if automatic failover with full server replication has been implemented (though it may seem longer for the system administrator). An enterprise that hasn't invested in the right solution could find their systems are down for several hours, the time needed to bring up the backup server and recover the data backups. If multiple applications need to be reinitialized, down time can extend into a second day. |
|