Skip navigation

Tag Archives: single point of failure

A few days before the recent British Airways (BA) catastrophic IT failure I was in Kuala Lumpur, Malaysia, giving a talk at the second ASEAN Business Continuity Conference entitled “Building a Robust ITDR Plan”.

The main thrust of this talk was that as IT is at the heart of every organisation, ITDR is at the heart of Business Continuity, and that it is up to the organisation’s top management to ensure that its ITDR plans both meet the needs of the organisation and are known to work.

It appears that BA’s ITDR plans did not work, and although we don’t know whether the plans were appropriate for BA, the possibility is that they weren’t. In any event, the failure certainly came as a nasty surprise to BA’s top management.

I was asked to provide a closing thought to my talk on “Building a Robust ITDR Plan”, and I used a quote from Georges Clemenceau, the Prime Minister of France in the First World War, to sum up my ideas. For those of you who aren’t that aware of the catastrophe suffered by France in that war, it lost a generation of young men. Out of 8 million men conscripted, 4 million were wounded and 1 in 6 killed.

Georges Clemenceau said “War is too serious a matter to entrust to military men.”

I said “ITDR is too serious a matter to entrust to technologists.”

BA will have learnt that lesson, as France did, the hard way.

Cyber and terrorist attacks currently appear to dominate Business Continuity (BC) thinking, but over the weekend we had a classic example of a good old fashioned failure of a critical IT system causing major disruption and some resulting poor incident management that compounded the problem. The company involved was British Airways (BA), and I say poor incident management because this is what the public has perceived and what BA customers experienced. No doubt there will be an internal BA investigation into what went wrong, but as a BC professional I’d love to know about three aspects of the incident and BA’s response:

  1. How long did it take from the initial failure of the system for the IT support technicians to realise that they were dealing with a major incident, who did they escalate the incident to (if anyone), were the people designated to handle major incident contactable, and was the problem compounded by the fact that BA’s IT had been outsourced to India?
  2. The system that failed is so critical to BA’s operations that it must have had a Recovery Time Objective (RTO) of minutes, or at worst, a couple of hours. To achieve this, BA should have put in place a duplicate live version of the system (Active/Active). Either BA did not have such a recovery option in place (I’m guessing that they had a replica – Active/Passive), which implies that they failed to understand the need to have a very short downtime on the system, or it had not been properly tested and failed when required.
  3. Why were the communications with customers  (people who were booked on BA flights) handled so badly? BA must have a plan to communicate with passengers, but was this dependent on the very system that failed?

For me, even before the inquest takes place, the major lesson to be learned is that the effectiveness of an organisation’s BC and incident response plans can only be assured by actually using the plans and responding to incidents. If you don’t want to find this out in response to a real incident, then you need to run realistic and regular exercises so that every aspect of your response is tested and the people involved know what to do. It doesn’t matter how good your Business Continuity Management (BCM) process is, how closely aligned to ISO 22301 it is, how good the result of the latest BC audit, or how much documentation you have. It’s your ability to respond effectively and recover in time that matters.

BA have suffered damage to their reputation , how much is yet to be seen. They will have suffered financial damage, and when the London Stock Market opens for trading we’ll see how much it has affected their share price. Maybe BA do run realistic and regular exercises. If they do, they should have identified the issues with the systems and incident response that were encountered over the weekend and acted on the lessons learned.

 

 

I was in the process of undertaking a risk assessment exercise for a client of mine when RBS suffered their systems failure the other day. By an amazing coincidence, I was working on the risks to the most urgent activities undertaken by their Finance department, and one of those activities is to make payments via BACS. I had identified that this activity was dependent on their bank, RBS, and was confronted with the very real problem of how do I assess the risk of RBS not being able to process payments because of a failure of their systems?

The likelihood of  RBS not being able to process payments because of a failure of their systems would normally be rated as being very low, but today it’s a racing certainty! Had I undertaken the risk assessment a week ago I would have rated the risk as very low and not worth taking any mitigating measures. Now the risk is very high (certainty multiplied by the impact on my client), so they should consider mitigating action – such as have an alternative bank.

The question is, what does this say about value of undertaking such risk assessments?

One of the things that I always look for when helping a client to implement Business Continuity is single points of failure. My latest client has managed to provide me with the best example of one yet, and the name of the single point of failure is Malcolm.

The client will remain nameless, to protect the innocent, but quite by chance it was revealed that one of their most critical and urgent activities is totally dependent one a single person working for an outsource supplier, and his name is Malcolm. If Malcolm is not available to do an activity that is on the critical path to enable one of the client’s most important services to be delivered, the client’s reputation will be destroyed. This service is delivered just once a year, and is vital to thousands of my client’s stakeholders.

I’ve never met Malcolm, but apparently he’s been undertaking this activity for many years, and it’s never failed. I can’t help wondering how old Malcolm is, or whether or not he’s in good health. I know who he works for, but again, I must protect the innocent.

I nearly missed this single point of failure, so from now on I’m going to redouble my efforts to find the Malcolms of this world.

 

Most people that I talk to agree, in theory, that having a single point of failure is not a good idea. However, these very same people appear to accept that it is reasonable for their own organisation to have many single points of failure if they are a fundamental part of the way that the organisation has been set up.

An example of such an organisation would be a government regulator with a single office in the centre of a country’s capital city. Having a single office containing all the staff, the organisation’s records, and its computer system, is a single point of failure. Any suggestion that they might like to make their organisation less vulnerable is met with solid resistance. If they were a manufacturing company I could better understand the reluctance, as duplicating manufacturing sites can often result in the company becoming noncompetitive. But an office based regulator?

I suppose that this is what’s called risk appetite, but it’s rarely consistent across all the risks that such an organisation faces.