Skip navigation

Tag Archives: Crisis Management

A few days before the recent British Airways (BA) catastrophic IT failure I was in Kuala Lumpur, Malaysia, giving a talk at the second ASEAN Business Continuity Conference entitled “Building a Robust ITDR Plan”.

The main thrust of this talk was that as IT is at the heart of every organisation, ITDR is at the heart of Business Continuity, and that it is up to the organisation’s top management to ensure that its ITDR plans both meet the needs of the organisation and are known to work.

It appears that BA’s ITDR plans did not work, and although we don’t know whether the plans were appropriate for BA, the possibility is that they weren’t. In any event, the failure certainly came as a nasty surprise to BA’s top management.

I was asked to provide a closing thought to my talk on “Building a Robust ITDR Plan”, and I used a quote from Georges Clemenceau, the Prime Minister of France in the First World War, to sum up my ideas. For those of you who aren’t that aware of the catastrophe suffered by France in that war, it lost a generation of young men. Out of 8 million men conscripted, 4 million were wounded and 1 in 6 killed.

Georges Clemenceau said “War is too serious a matter to entrust to military men.”

I said “ITDR is too serious a matter to entrust to technologists.”

BA will have learnt that lesson, as France did, the hard way.

Advertisements

Cyber and terrorist attacks currently appear to dominate Business Continuity (BC) thinking, but over the weekend we had a classic example of a good old fashioned failure of a critical IT system causing major disruption and some resulting poor incident management that compounded the problem. The company involved was British Airways (BA), and I say poor incident management because this is what the public has perceived and what BA customers experienced. No doubt there will be an internal BA investigation into what went wrong, but as a BC professional I’d love to know about three aspects of the incident and BA’s response:

  1. How long did it take from the initial failure of the system for the IT support technicians to realise that they were dealing with a major incident, who did they escalate the incident to (if anyone), were the people designated to handle major incident contactable, and was the problem compounded by the fact that BA’s IT had been outsourced to India?
  2. The system that failed is so critical to BA’s operations that it must have had a Recovery Time Objective (RTO) of minutes, or at worst, a couple of hours. To achieve this, BA should have put in place a duplicate live version of the system (Active/Active). Either BA did not have such a recovery option in place (I’m guessing that they had a replica – Active/Passive), which implies that they failed to understand the need to have a very short downtime on the system, or it had not been properly tested and failed when required.
  3. Why were the communications with customers  (people who were booked on BA flights) handled so badly? BA must have a plan to communicate with passengers, but was this dependent on the very system that failed?

For me, even before the inquest takes place, the major lesson to be learned is that the effectiveness of an organisation’s BC and incident response plans can only be assured by actually using the plans and responding to incidents. If you don’t want to find this out in response to a real incident, then you need to run realistic and regular exercises so that every aspect of your response is tested and the people involved know what to do. It doesn’t matter how good your Business Continuity Management (BCM) process is, how closely aligned to ISO 22301 it is, how good the result of the latest BC audit, or how much documentation you have. It’s your ability to respond effectively and recover in time that matters.

BA have suffered damage to their reputation , how much is yet to be seen. They will have suffered financial damage, and when the London Stock Market opens for trading we’ll see how much it has affected their share price. Maybe BA do run realistic and regular exercises. If they do, they should have identified the issues with the systems and incident response that were encountered over the weekend and acted on the lessons learned.

 

 

I have been further convinced of the need for the Business Continuity (BC) profession to get back to its fundamentals by the juxtaposition of the publication by the Business Continuity Institute (BCI) of a comprehensive list of legislation, regulations, standards and guidelines in the field of Business Continuity Management (BCM) and the experience of many business that were affected by the recent floods in the north-west of England.

Some small businesses, mainly those that operate and serve very local markets, have temporarily closed until their premises can be refurbished, but others are up and running and continuing to trade even though their premises were badly flooded. The businesses that are back up and running had implemented BC, but not in the way envisaged by the BC profession through its standards and guidelines.

These businesses had taken steps to ensure that they could recover from incidents like the recent flooding by doing such things as backing up their data, implementing cloud computing, knowing where they could obtain replacement premises and equipment, being able to redirect their telephones, and having adequate insurance cover. They are also managed by people who know how to respond to incidents, are committed to the continued success of their business, and know what needs to be recovered by when without having to read a plan.

None of these businesses had implemented a formal BCM programme, none of them had followed any guidelines, and none of them had implemented a Business Continuity Management System (BCMS) or been certified to a BCM standard.

The publication by the BCI of a comprehensive list of BCM legislation, regulations, standards and guidelines is very useful, and I’m not decrying it. But, and it is a very big but, the purpose of BC is to enable organisations to be resilient to incidents that affect their ability to operate. The people who own and run business in the north-west of England that had taken steps to ensure that they could recover from the recent flooding are practising the fundamentals of BC, and by and large have never even heard of BCM legislation, regulations, standards and guidelines.

Don’t get me wrong, there’s nothing wrong with BCM legislation, regulations, standards and guidelines, but they are not the end in itself. I sometimes think that BC professionals lose sight of this.

As most people are only too well aware, the way that we find and use information is going through a radical and fundamental change, which is being driven by the Internet. What doesn’t seem to have permeated the world of Business Continuity though, is that this change is revolutionising the Business Continuity Plan.

Not too many years ago, in our house, we used to keep a telephone directory and combined bus and train timetable near our front door, close to where we had our telephone. Today, we have neither of those things, and if we want to find a telephone number or the time of a bus or train we’ll simply use the Internet, and rapidly find what we’re looking without wading through pages and pages of small print trying to decipher how the directory or timetable is organised before getting to the information that we want. We also had the depressing problem of finding out later on that we’d looked up the information in a document that was out of date, and that one of the family had inadvertently thrown away the new version and kept the old one.

Telephone directories and timetables are just two examples of documents that are being used by fewer and fewer people, and most of those are older people who find it hard to change a lifetime’s habits. Using printed documents to find information is becoming a thing of the past, as anyone who mixes with youngsters will confirm. Why then, do we persist with documents in the world of Business Continuity, what’s wrong with just finding the information that we need from the Internet?

The problems of document based Business Continuity Plans are only too well known. Unfortunately, more often than not, they are difficult to use in a crisis, contain unnecessary information, and are out of date. What we really need is something that is simple to use, delivers exactly what is required, and provides the latest information. That is an App.

An App is short for an Application, and is quite simply a piece of software designed to fulfil a particular purpose, and is downloaded by a user to a computing device from which it can be used. Apps can be used to obtain information, and when designed to provide the information required to respond to an incident, they are an ideal and powerful tool.

Don’t make the mistake of thinking that holding a Business Continuity Plan as a PDF document and making it available on the Internet via an App is the same thing as an App designed to enable someone to respond to an incident, it’s not. You don’t look up the time of a train on the Internet by opening up a PDF document and searching through it, do you?
A Business Continuity App can provide responders with clear, action orientated, and time-based direction, while allowing quick access to relevant and up to date support information. Exactly what we want to achieve.

This revolution has profound consequences for world of Business Continuity, and if you’d to find out what these are, then come and listen to me present at the BCI World Conference and Exhibition in November. The Business Continuity Plan, as a document, is dead, long live the Business Continuity App.

The Bank of England has just been heavily criticised in a report by Deloitte into the unprecedented day-long
collapse of its Real-Time Gross Settlements system last October. Deloitte that found that the Bank’s officials had never rehearsed what would happen in the event of the platform going down for any length of time, and to compound the problem, Deloitte also discovered that the three Bank of England executives with responsibility for the system were all out of the country on the day the outage happened. Not only did the system fail, but the Bank had virtually no crisis management plans in place to deal with the incident.

Unfortunately, in my experience of providing Business Continuity services to a wide variety of organisations over many years, one of the constant themes that I come across is  the failure to exercise recovery plans. It’s not a point blank refusal to run an exercise that’s the problem, instead it’s the constant postponement that eventually results in the failure to exercise a recovery plan.

All sorts of good reasons are given for postponing an exercise, from the understandable fact that everyone is just too busy at the present time to the ludicrous idea that the recovery shouldn’t be exercised until it is known to work (which came first, the chicken or the egg?) And so it goes on, month after month, year after year, with everyone saying that they intend to run an exercise, but with nobody committing to a date or time.

Don’t get me wrong, I do have clients that do exercise their recovery plans, but they are in a minority and they don’t exercise every plan as often as they should. I’ve tried all sorts of ideas to overcome this problem, but none of them seemed to have worked. Is this just a fact of life, or can something really be done to make sure that recovery plans are exercised on a regular basis?

The RBS systems failure should become a case study in Business Continuity, but I doubt that it will as the bank won’t want to advertise how it managed to not only get something seriously wrong, but how it took so long to fix and what it really cost. Every Business Continuity professional should be interested in this so that they can learn from any mistakes that were made, and see how Business Continuity Plans were used in response to a real disruption.

The first thing that I’m interested in though, is whether or not RBS activated its strategic level Business Continuity Plan, which may be known as an Incident or Crisis Management Plan. Presuming that RBS has such a plan, was it used, or did a group of senior executives just get together and decide what to do without reference to the plan?

Secondly, did the person who first identified that a software upgrade had gone wrong just try and fix it, or did they also escalate the issue up the management chain of command? If so, did it get to the top quickly, or did it stay hidden until the effect of the problem became widely known?

Being a UK taxpayer, I’m a shareholder in RBS. As a shareholder, I’d like RBS to undertake a thorough post incident review and publish the results so that we can all learn from what went wrong.