Skip navigation

Tag Archives: lessons learned

A few days before the recent British Airways (BA) catastrophic IT failure I was in Kuala Lumpur, Malaysia, giving a talk at the second ASEAN Business Continuity Conference entitled “Building a Robust ITDR Plan”.

The main thrust of this talk was that as IT is at the heart of every organisation, ITDR is at the heart of Business Continuity, and that it is up to the organisation’s top management to ensure that its ITDR plans both meet the needs of the organisation and are known to work.

It appears that BA’s ITDR plans did not work, and although we don’t know whether the plans were appropriate for BA, the possibility is that they weren’t. In any event, the failure certainly came as a nasty surprise to BA’s top management.

I was asked to provide a closing thought to my talk on “Building a Robust ITDR Plan”, and I used a quote from Georges Clemenceau, the Prime Minister of France in the First World War, to sum up my ideas. For those of you who aren’t that aware of the catastrophe suffered by France in that war, it lost a generation of young men. Out of 8 million men conscripted, 4 million were wounded and 1 in 6 killed.

Georges Clemenceau said “War is too serious a matter to entrust to military men.”

I said “ITDR is too serious a matter to entrust to technologists.”

BA will have learnt that lesson, as France did, the hard way.

Cyber and terrorist attacks currently appear to dominate Business Continuity (BC) thinking, but over the weekend we had a classic example of a good old fashioned failure of a critical IT system causing major disruption and some resulting poor incident management that compounded the problem. The company involved was British Airways (BA), and I say poor incident management because this is what the public has perceived and what BA customers experienced. No doubt there will be an internal BA investigation into what went wrong, but as a BC professional I’d love to know about three aspects of the incident and BA’s response:

  1. How long did it take from the initial failure of the system for the IT support technicians to realise that they were dealing with a major incident, who did they escalate the incident to (if anyone), were the people designated to handle major incident contactable, and was the problem compounded by the fact that BA’s IT had been outsourced to India?
  2. The system that failed is so critical to BA’s operations that it must have had a Recovery Time Objective (RTO) of minutes, or at worst, a couple of hours. To achieve this, BA should have put in place a duplicate live version of the system (Active/Active). Either BA did not have such a recovery option in place (I’m guessing that they had a replica – Active/Passive), which implies that they failed to understand the need to have a very short downtime on the system, or it had not been properly tested and failed when required.
  3. Why were the communications with customers  (people who were booked on BA flights) handled so badly? BA must have a plan to communicate with passengers, but was this dependent on the very system that failed?

For me, even before the inquest takes place, the major lesson to be learned is that the effectiveness of an organisation’s BC and incident response plans can only be assured by actually using the plans and responding to incidents. If you don’t want to find this out in response to a real incident, then you need to run realistic and regular exercises so that every aspect of your response is tested and the people involved know what to do. It doesn’t matter how good your Business Continuity Management (BCM) process is, how closely aligned to ISO 22301 it is, how good the result of the latest BC audit, or how much documentation you have. It’s your ability to respond effectively and recover in time that matters.

BA have suffered damage to their reputation , how much is yet to be seen. They will have suffered financial damage, and when the London Stock Market opens for trading we’ll see how much it has affected their share price. Maybe BA do run realistic and regular exercises. If they do, they should have identified the issues with the systems and incident response that were encountered over the weekend and acted on the lessons learned.

 

 

In a survey about the experience of handling major losses undertaken Vericlaim and Alarm, more than half of respondents “rated the practical assistance offered by a BCP (Business Continuity Plan) following a major incident as one or two out of a possible score of five”. In other words, the BC Plans of the organisations responding to the survey were found to not particularly helpful when responding to a major loss!

This finding seems to have been rather under reported by the BC community who are usually so forward in explaining the importance of having a BC Plan and extolling the virtues of BC in improving resilience. Personally, I find it a damning indictment of the BC profession.

One of the things that constantly both amuses and horrifies me is how far most BC Plans are from the description given in the Business Continuity Institute’s (BCI’s) Good Practice Guidelines. This states that a BC Plan should be “…focused, specific and easy to use…”, and that the important characteristics for an effective BC Plan are that is direct, adaptable, concise, and relevant.

Over the years I have had the pleasure of see hundreds, if not thousands of BC Plans from a wide variety of organisations, and I can safely say that more than 90% of these plans do not fit in with this description. They tend to contain lots of information that is irrelevant to the purpose of responding to a major incident and seem to be written more for the benefit of the organisation’s auditors than for use by people who need to take action to reduce the impact of the incident on the organisation.

As a BC consultant, I keep trying my best to improve BC Plans, but I’m constantly being knocked back by people who tell me that all sorts of things need to be put into their BC Plans, more often than not because of an audit or review undertaken by a third party.

For far too long this situation has been allowed to continue unchallenged. It cannot do so for too much longer without the BC profession losing credibility.

 

 

Reading about one of the causes of the catastrophic failures at Mid Staffordshire NHS Trust, which lead to more than 1,200 patient deaths, reminded me of a similar issue that plagues many implementations of Business Continuity Management (BCM) programmes. This was the Trust’s concentration on achieving targets that would enable them to get a good rating from the NHS auditors rather than the most important objective, which was to ensure that patients left hospital in a better state of health than when they were admitted.

The issue in many BCM implementations is that organisations are looking to get a good rating from their auditors by doing all the things that a standard states they should do rather than the working to achieve the most important objective, which is to improve the organisation’s resilience.

Setting targets based on readily measurable things is straightforward, and allows auditors to identify whether or not an outcome has been achieved, or how close it is to being achieved. Setting targets on things that it’s difficult to measure is problematic, and gives auditors a major problem when making an assessment. Unfortunately, the trend in many sectors over the past 20 years has been to rely more and more on these measurable targets when assessing performance, and to ignore the most important target. BCM has been no exception – achieving compliance against BS 25999 or ISO 22301 is commonly seen as the main objective, not becoming more resilient.

Hopefully, what has happened at Mid Staffordshire NHS Trust will be the start of the end of relying on peripheral, measurable targets, and the world will move back to looking at how well an organisation is achieving its critical objectives. Don’t bet the house on it though.

The RBS systems failure should become a case study in Business Continuity, but I doubt that it will as the bank won’t want to advertise how it managed to not only get something seriously wrong, but how it took so long to fix and what it really cost. Every Business Continuity professional should be interested in this so that they can learn from any mistakes that were made, and see how Business Continuity Plans were used in response to a real disruption.

The first thing that I’m interested in though, is whether or not RBS activated its strategic level Business Continuity Plan, which may be known as an Incident or Crisis Management Plan. Presuming that RBS has such a plan, was it used, or did a group of senior executives just get together and decide what to do without reference to the plan?

Secondly, did the person who first identified that a software upgrade had gone wrong just try and fix it, or did they also escalate the issue up the management chain of command? If so, did it get to the top quickly, or did it stay hidden until the effect of the problem became widely known?

Being a UK taxpayer, I’m a shareholder in RBS. As a shareholder, I’d like RBS to undertake a thorough post incident review and publish the results so that we can all learn from what went wrong.