Skip navigation

Tag Archives: resilience

A few days before the recent British Airways (BA) catastrophic IT failure I was in Kuala Lumpur, Malaysia, giving a talk at the second ASEAN Business Continuity Conference entitled “Building a Robust ITDR Plan”.

The main thrust of this talk was that as IT is at the heart of every organisation, ITDR is at the heart of Business Continuity, and that it is up to the organisation’s top management to ensure that its ITDR plans both meet the needs of the organisation and are known to work.

It appears that BA’s ITDR plans did not work, and although we don’t know whether the plans were appropriate for BA, the possibility is that they weren’t. In any event, the failure certainly came as a nasty surprise to BA’s top management.

I was asked to provide a closing thought to my talk on “Building a Robust ITDR Plan”, and I used a quote from Georges Clemenceau, the Prime Minister of France in the First World War, to sum up my ideas. For those of you who aren’t that aware of the catastrophe suffered by France in that war, it lost a generation of young men. Out of 8 million men conscripted, 4 million were wounded and 1 in 6 killed.

Georges Clemenceau said “War is too serious a matter to entrust to military men.”

I said “ITDR is too serious a matter to entrust to technologists.”

BA will have learnt that lesson, as France did, the hard way.

Advertisements

Cyber and terrorist attacks currently appear to dominate Business Continuity (BC) thinking, but over the weekend we had a classic example of a good old fashioned failure of a critical IT system causing major disruption and some resulting poor incident management that compounded the problem. The company involved was British Airways (BA), and I say poor incident management because this is what the public has perceived and what BA customers experienced. No doubt there will be an internal BA investigation into what went wrong, but as a BC professional I’d love to know about three aspects of the incident and BA’s response:

  1. How long did it take from the initial failure of the system for the IT support technicians to realise that they were dealing with a major incident, who did they escalate the incident to (if anyone), were the people designated to handle major incident contactable, and was the problem compounded by the fact that BA’s IT had been outsourced to India?
  2. The system that failed is so critical to BA’s operations that it must have had a Recovery Time Objective (RTO) of minutes, or at worst, a couple of hours. To achieve this, BA should have put in place a duplicate live version of the system (Active/Active). Either BA did not have such a recovery option in place (I’m guessing that they had a replica – Active/Passive), which implies that they failed to understand the need to have a very short downtime on the system, or it had not been properly tested and failed when required.
  3. Why were the communications with customers  (people who were booked on BA flights) handled so badly? BA must have a plan to communicate with passengers, but was this dependent on the very system that failed?

For me, even before the inquest takes place, the major lesson to be learned is that the effectiveness of an organisation’s BC and incident response plans can only be assured by actually using the plans and responding to incidents. If you don’t want to find this out in response to a real incident, then you need to run realistic and regular exercises so that every aspect of your response is tested and the people involved know what to do. It doesn’t matter how good your Business Continuity Management (BCM) process is, how closely aligned to ISO 22301 it is, how good the result of the latest BC audit, or how much documentation you have. It’s your ability to respond effectively and recover in time that matters.

BA have suffered damage to their reputation , how much is yet to be seen. They will have suffered financial damage, and when the London Stock Market opens for trading we’ll see how much it has affected their share price. Maybe BA do run realistic and regular exercises. If they do, they should have identified the issues with the systems and incident response that were encountered over the weekend and acted on the lessons learned.

 

 

Finally, at long last, there appears to be some real evidence that Business Continuity (BC) works. After years of effort trying to debunk the 80% myth (80% of organisations that don’t have a BC plan fail withing 18 months of suffering from a major incident – or something similar), I’ve now seen some real research that demonstrates that BC does, in fact, have a beneficial impact.

The research takes the form of a study from IBM Security (conducted by the Ponemon Institute), which analyses the financial impact of data breaches. According to the study, leveraging an incident response team was the single biggest factor associated with reducing the cost of a data breach: saving companies nearly $400,000 on average (or $16 per record).  The study also found that the longer it takes to detect and contain a data breach, the more costly it becomes to resolve.

Admittedly, the study covers only cyber security, but at least it’s a start. It confirms the long held assumption in BC circles that being able to quickly and effectively activate a response team to handle an incident is one of the most effective ways of reducing the impact of the incident on the organisation.

Now all we need is for someone to widen the research to cover all disruptive incidents. Anyone want to do a PhD is BC?

The report can be downloaded at http://www-03.ibm.com/security/data-breach/index.html.

I have just attended a very good Business Continuity (BC) conference held in Malaysia by GRC Consulting Services in conjunction with the Business Continuity Institute (BCI), but I couldn’t help being concerned about the fact that the standards industry is producing more and more management systems standards in and around the subject of BC.

Why is this happening? Well, to my mind, there seem to be two drivers behind this trend, neither of which are good for BC.

The first one, which an increasing number of people seem to be talking about, is that the main bodies behind the development of all these standards have discovered a rich source of revenue and are now exploiting this for all that it’s worth. These bodies claim to be “not for profit”, but like many such organisations there are large numbers of people engaged in standards activities that derive considerable profit from the work that they do. The more standards that they produce the more these people profit from the work that they do.

This driver is simply the age old story of people making a profit when they can, and is not too dangerous as it will eventually come to an end when the people buying and using the standards come to realise what’s going on. The second driver though, it much more dangerous, as it strikes at the heart of BC and has the capacity to cause enormous damage.

This second driver is the desire to make something that is difficult, complex, and demanding, and which requires considerable skill and experience, simple to implement through a process that can be implemented by a management system. To see what I mean, you need look no further than BS 65000, the recently published Guidance for Organizational Resilience, which, to quote the body that produced it – “This landmark standard provides an overview of resilience, describing the foundations required and explaining how to build resilience.”

Organizational Resilience is something that every company continuously tries to achieve. It is nothing new, and has been an essential goal ever since the first company was founded. Few manage it over the long term, and the life of most companies is very short as the products and services that they produce become outdated and overtaken by new trends, ideas, and inventions. If explaining how to build resilience can be described in a short pamphlet and implemented by anyone with the capability to read and follow a set of procedures, then how come it was missed by so many millions of people involved in the running of the hundreds of thousands of companies that have failed?

The international standard for Organizational Resilience (ISO 22316) is due to publish in 2016, which must be a great relief for all those organisations that are struggling to survive in the ever more competitive markets in which they operate. All they now have to do is implement the standard, be audited for compliance, and get the certificate. So much easier than researching and developing new products, finding new markets, producing the products and services at competitive cost, controlling cash flow, hiring and maintaining the right people with the right skills, complying with ever increasing legislation, developing and enhancing reputation, etc.

 

Finally, there is real concrete evidence that an organisation’s ability to recover is central to its immediate survival. Not its ability to recover after an incident, but its ability to demonstrate its recovery capability as perceived by others before any incident occurs. Business Continuity is now firmly center stage.

According to The Times, senior UK government officials “want the Co-operative Bank to be sold to a bigger player that could stabilise its IT system, which is feared to be so precarious that the bank could not cope with a serious problem.” For years I’ve been telling senior executives that not being able to demonstrate the existence of credible and tested Business Continuity arrangements could mean the difference between survival and failure, and now I can point to a real example. Business Continuity is not just for use in response to an incident – it must be demonstrable to interested parties well before any incident takes place.

Apparently, In the risk factors disclosed in its annual report, the Co-operative Bank has stated that “whilst a basic level of resilience to a significant data outage is in place, the bank does not currently have a proven end-to-end disaster recovery capability”. How many organisations can really hand on heart state that they have a proven end-to-end disaster recovery capability? Not that many.

Business Continuity has been practised in the banking industry for more than 25 years, and many of today’s accepted Business Continuity ideas and practices started in banking. Where banking leads in Business Continuity, other industries follow.

How long will it be before organisation’s in other industries are put at risk because they do not have a proven end-to-end disaster recovery capability?

Resiliency, or rather Business Resilience, seems to be the flavour of the month in the Business Continuity and Risk industries. Apparently, businesses are moving away from having separate silos for Security, Risk, Health & Safety, Business Continuity, etc., and are bringing all these related disciples under the heading of resiliency and are appointing a Head of Resilience.

This all sounds quite good, and is for once a piece of joined up thinking, except that the idea of Resiliency goes beyond these operational areas to the idea of ensuring that the business itself is resilient, which takes the discipline into the areas of leadership, reputation, innovation, product development, marketing, etc.. In other words, it seems to be about everything that the business does, and that a single manager should be appointed to ensure that the business should remain resilient in the changing environment in which it operates.

Now, tell me if I’m wrong, but I thought that this was actually the point of a Board of Directors. One of the prime responsibilities of a Director of a company according to UK law is to “try to make the company a success, using your skills, experience and judgement”. In other words it is the responsibility of every Director of a company to ensure that the company is resilient – it should not be delegated to a manager as Head of Resilience.

The Business Continuity and Risk industries should either start talking about Operational Resilience, or stop talking about Resiliency.

Reading about one of the causes of the catastrophic failures at Mid Staffordshire NHS Trust, which lead to more than 1,200 patient deaths, reminded me of a similar issue that plagues many implementations of Business Continuity Management (BCM) programmes. This was the Trust’s concentration on achieving targets that would enable them to get a good rating from the NHS auditors rather than the most important objective, which was to ensure that patients left hospital in a better state of health than when they were admitted.

The issue in many BCM implementations is that organisations are looking to get a good rating from their auditors by doing all the things that a standard states they should do rather than the working to achieve the most important objective, which is to improve the organisation’s resilience.

Setting targets based on readily measurable things is straightforward, and allows auditors to identify whether or not an outcome has been achieved, or how close it is to being achieved. Setting targets on things that it’s difficult to measure is problematic, and gives auditors a major problem when making an assessment. Unfortunately, the trend in many sectors over the past 20 years has been to rely more and more on these measurable targets when assessing performance, and to ignore the most important target. BCM has been no exception – achieving compliance against BS 25999 or ISO 22301 is commonly seen as the main objective, not becoming more resilient.

Hopefully, what has happened at Mid Staffordshire NHS Trust will be the start of the end of relying on peripheral, measurable targets, and the world will move back to looking at how well an organisation is achieving its critical objectives. Don’t bet the house on it though.

I’m very pleased that I’ve managed to get my latest client, a small electronics company that actually decided by themselves to implement Business Continuity Management (BCM) rather than being told to, to think about the maximum scale of incident that it wants to plan to survive. Many organisations shy away from this issue, which makes it difficult when advising on safe separation distances for backups and recovery sites, but my client’s management team understands the issues and will be coming up with an answer.

I think that the factor that will determine the answer is the geographic spread of their staff. If there is some kind of natural or man made disaster that affects the homes and families of most of the staff then it is unlikely that they will want to come to work to help out their employer, particularly if their employer is asking them to work a significant distance from their families who may be evacuated.

If this is the case then we’re probably talking of their surviving an incident that has an effective radius of about 30km. Such an incident would take quite a catastrophic and unlikely event given that the client is nowhere near a nuclear or chemical facility, well away from the coast, and not in an earthquake zone or near an active volcano. The most likely wide spread event is a river flood, but that doesn’t usually last more than a few weeks in the UK.

One of the things that I always look for when helping a client to implement Business Continuity is single points of failure. My latest client has managed to provide me with the best example of one yet, and the name of the single point of failure is Malcolm.

The client will remain nameless, to protect the innocent, but quite by chance it was revealed that one of their most critical and urgent activities is totally dependent one a single person working for an outsource supplier, and his name is Malcolm. If Malcolm is not available to do an activity that is on the critical path to enable one of the client’s most important services to be delivered, the client’s reputation will be destroyed. This service is delivered just once a year, and is vital to thousands of my client’s stakeholders.

I’ve never met Malcolm, but apparently he’s been undertaking this activity for many years, and it’s never failed. I can’t help wondering how old Malcolm is, or whether or not he’s in good health. I know who he works for, but again, I must protect the innocent.

I nearly missed this single point of failure, so from now on I’m going to redouble my efforts to find the Malcolms of this world.

 

One of the most frustrating things about being a Business Continuity practitioner is the lack of interest shown by so many people who should be more engaged.

An symptom of this can be seen on one of the Business Continuity discussion groups on LinkedIn, where someone is asking why it’s so hard to get people to complete their Business Continuity Plans. The simple answer is that, by and large, people aren’t interested and feel that they have much better things to do with their time.

To my mind, overcoming this view is one of the most significant challenges that a Business Continuity practitioner faces. Well done if you’ve managed to get most people in your organisation to take an active interest in the subject, but don’t despair if you haven’t. You need to be resilient, just like the organisation that you’re trying to help.