How Can We Optimize BC/DR KPI’s? (Part III)

To conclude the discussion about BC/DR KPIs optimization, let’s talk about Recovery Window and System Redundancy.

Both are very dependent on how much resources you can allocate to you BC/DR solution, so the KPI optimization essentially boils down to optimizing the value you get for the price.
This became even more critical with the advent of the Cloud, since every gigabyte you allocate will be multiplied by the number of downstream environments, and eventually and will eventually blob the Cloud paycheck significantly.

 The general guidelines here are:
— Be very cautious with auto-scale technologies. Remember, sometimes what is allocated once can’t be taken back.
— Use incremental backups. Full backups are good for decreasing RTO but can drastically increase space usage.
— Consider implementing compression to save storage space.
— Use cheaper, archive-class storage for older data that you don’t need immediate access to. Leverage Cloud offerings (Like AWS Glacier) or tape technologies.
— Don’t forget to do system housekeeping, including optimizing your database space usage — make sure the gigabytes allocated are actually used.
— Apply resiliency-based techniques that I mentioned in the previous posts: cold standby, multiplexing to minimize resiliency cost.

Key Components of a Business Continuity Plan

The success of many companies across the globe depends on the stability of their mission-critical IT systems. When these systems go down, companies are often left struggling to deal with a variety of operational issues.

The loss of productivity, money and time associated with downtime can lead to more than just an inconvenience. Some business disasters are capable of driving a company out of business altogether.

Stats tell us that the average cost of one minute of system downtime is about $5,600. The hourly cost of IT downtime can range from $140,000 and $540,000. This is why it’s extremely important for each of your mission-critical systems to have a comprehensive, robust business continuity plan.

Unfortunately, most companies either don’t have such a plan or they have inefficient BCP plans incapable of producing the desired results. The following are some of the key components of a proper BCP program:

  • System resilience solution: This part is to ensure the system’s ability to withstand outages without being forced to perform disaster recovery. A resilient cluster configuration can be a good example here.
  • Backup solution: It is about storing backup copies of critical files and data in case of any possible system outage. A Backup ensures that you will be able to restore at all.
  • Disaster Recovery solution: A technical solution to recovering from any possible disaster in a fast and effective way. DR makes sure that you are able to restore promptly and adhere to your KPIs’ target values. A geographically distanced replica of your DB is a viable example here.
  • Monitoring: This part is also critical. How are you going to start disaster recovery without identifying the issue, or knowing what exactly has caused the system failure? If your system fails on Saturday at midnight and no one is able to restore it till Monday, you will obviously not meet that 1-hour RTO target.
  • Support: In addition to the previous one — how can you recover a mission-critical system that runs 24/7 without having 24/7 technical support and resources?
  • Documentation: This one is pretty obvious. You need to have a good reference to your system’s layout to be able to effectively work and promptly recover.
  • Automation: The fewer manual processes mean minimum chances of human error and systems that can get back online faster.

We will talk more about these essential BCP components in the coming posts.

Disaster recovery plan planning and implementation

Do you have a disaster recovery plan in place to effectively deal with pandemics, floods, fires, cybersecurity attacks, etc? Multiple things can go wrong when incidents unfold in an unexpected way.

For example, if a cyberattack halts your business operations, how would you quickly restore your IT processes and ensure business continuity?

A well-thought-out Business Continuity plan helps you maintain business functions or revive your business in the event of major disruptions. It outlines processes that you must follow when something has a devastating impact on your business.

Whether you’re a small business owner or a large corporation, your success largely depends on your ability to stay competitive during adverse circumstances.

Increased customer confidence, a company’s reputation, and market value are three of the major benefits of designing an effective business continuity plan.

 

What Should be the First Step?

Work on fundamentals. Ask yourself the following questions to bring clarity to your vision and define the BCP implementation project:

1) What are the systems we are running?
2) How does the systems’ downtime impact our business operations?
3) How much money will our company lose, in case this system goes down?

And then, using the answers from the previous 3 questions, answer:

4) What are the KPI values our solution needs to meet?
5) What are the essential tools I need to support my strategy?

 

Thoroughly answer these questions as your first step toward designing a spot-on business continuity plan.
From the anticipated money loss, you’ll be able to figure out the DR project budget.

To define your criteria of success, you need to identify your KPIs. Some of the notable KPIs include:

  • Recovery Point Objective
  • Recovery Time Objective
  • Recovery Window
  • System Redundancy

Then, from the KPI values, you’ll be able to decide on the tools you want to use.

So, as you probably noticed, KPIs are the essential bullet here. If you haven’t yet documented KPIs for your disaster recovery solution, it’s high time to do so. I’ll talk more about tools and processes soon.

KPIs for Disaster Recovery and Business Continuity

I briefly talked about designing a Business Continuity (BC) Plan in my previous post. Let’s now talk about Business Continuity and Disaster Recovery KPIs.

We understand that data is king when it comes to continuity and recovery. Reporting on the right metrics is one of the ways to know whether your solution is working or not, and to figure out what are you trying to build at all.

However, can be a challenge for business continuity and DR managers to implement strong KPIs that clearly articulate the value of their actions.

Here are four metrics that I usually recommend to define and measure the completion and performance of BC/DR program:

1) Recovery Point Objective (RPO) 

It is defined as the maximum amount of data you can lose after recovering from a disaster. It can range from days to absolute no loss. Depending on the solution, RPO can determine how frequently you might need to backup your data, or even which BC/DR solution you should choose.

An average Oracle DB system can reach «several minutes» values pretty easily, requiring some more effort/investment after that.

2) Recovery Time Objective (RTO)

RTO defines how much time you’re allowed to spend recovering, at max. It’s a maximum amount of time within which your business must restore after any possible disaster.

Depending on the solution size and system performance, a typical Oracle system almost effortlessly achieves a value of «several hours», while reaching something closer to «seconds» requires considerable investments.

3) Recovery Window 

How old data from the past you might need to retrieve. This value can be enforced by some regulatory rules, or by prior experience, or by prior user requests for data.

4) System Redundancy 

How many copies of data/software/hardware do you want to have? This can scale from two to infinity, with robustness scaling alongside. However, the more copies, the higher the costs.

A good practice is the popular «3-2-1 rule», having three copies of data: two backup copies and one copy offsite. This is not a silver bullet, but something to start with.

How Can We Optimize BC/DR KPI’s? (Part I — RPO)

In the last few posts, we talked about how to choose the right Business Continuity (BC) and Disaster Recovery (DR) KPIs. Let’s take a quick look at how we can optimize those KPIs to meet the business goals.

We understand that metrics are how we define success and failure. And how different metrics demand different action plans.

In this post, I am going to talk about Recovery Point Objective (RPO) and how to optimize it.

Let’s get started!

1) Backup often

How frequently do you back up your data? If you back up daily, it means the maximum RPO of 24 hours and that you can’t lose data for more than 24 hours. The easiest way to optimize this is to simply backup more often — just stay reasonable here.

2) Utilize replication and re-synchronization technologies

Copying data over a network helps you keep multiple, up-to-date copies of your critical data. This way you can make RPO even lower without the need to backup unreasonably often. Therefore, make sure to utilize replication and re-synchronization technologies, and configure them to match the desired KPI value.

3) Multiplex your critical files

All systems have files that are more critical than the others. For instance, Oracle Database has Control Files and Redo Logs -losing which will compromise the system recoverability greatly. The solution here is multiplexing. Just have several physical copies of such files for the case one is corrupted or lost. This can be done in multiple ways, from built-in features (like with Oracle) to RAID arrays.

4) Take advantage of using cold standby systems

You should consider having one system as a backup for another identical primary system. A «Cold» system (compared to a conventional cluster/standby) means there are no services running on the standby, allowing you to get rid of the excessive costs, like compute, memory, and license. Just make sure to be able to allocate all that resources during the standby promotion. Such an approach helps you to optimize both on-prem and cloud DR solutions, as well as minimize license costs.

I’ll be creating parts 2 and 3 of this post to discuss further, to talk about other KPIs optimization soon. So, stay tuned!

About Upgrade Consulting:
Upgrade Consulting is a smart team of certified engineers and IT professionals who manage and optimize critical systems. Take a look at our services!

How Can We Optimize BC/DR KPI’s? (Part II — RTO)

Continuing the discussion about optimization of the DR KPIs that serve as benchmarks against which you measure your disaster readiness.

In part one, I talked about how you can improve RPO. In this post, I am going to identify strategies to optimize Recovery Time Objective (RTO). Consider the following strategies to improve your RTO:

1) Document everything 

It’s important to document every aspect of both your system and restore/recovery process to embrace a more proactive approach. Your engineers shouldn’t be forced to figure out the actions on the go. Instead, they should be able to simply copy-paste the required commands from the DR procedure document.

2) Implement a monitoring system and have a support team

How can you start recovery, if you don’t even know when something has failed?
How can you start recovery if you have no technical resources to perform it?
As I mentioned in one of my previous posts, having a monitoring and support system is essential to build the recovery processes. So, be sure to have a system in place that can keep track of everything, and a team/person who can do the actual work.

3) Utilize replication and re-synchronization technologies 

Same as I mentioned in the previous post. If you can just switch over to a working backup infrastructure, you don’t need to spend time restoring/recovering from a backup

4) Automate

Automation is one of the best strategies to improve your operational efficiency and business continuity efforts. Identify all the processes that can be automated with a tool.

Use Infrastructure-as-a-Code and Configuration Management tools wherever feasible. It will drastically decrease the chances of human error and speed up your operations.

When implemented the right way, automation solutions can improve RTO by up to 50% and optimize DR workflows.

If applicable, automate the switch over and make it dependent on system metrics. Why? Because smart automation technologies outperform humans in various business aspects.

5) Test your BC/DR solutions regularly 

This one is the most important of all. Make sure your business continuity and disaster recovery systems are up and running throughout the year. It’s recommended to test them twice a year, if not quarterly. If something goes wrong during the DR execution, it will increase the RTO significantly. You can’t be sure about your solution until it’s tested.

About Upgrade Consulting

Upgrade Consulting is a smart team of certified engineers and IT professionals who manage and optimize critical systems. Take a look at our services!