OUR THOUGHTSTechnology
A closer look at the CrowdStrike and Microsoft incident
Posted by Bethan Timmins . Jul 24.24
Over the weekend, we all had a front-row seat to what has been called ‘the largest IT outage in history’. Crucial infrastructure was brought to a standstill all over the world, with airlines, banks and our essential services all impacted.
An estimated 8.5 million Microsoft-using machines were struck down due to a faulty update pushed out by Cybersecurity firm CrowdStrike.
41kb of code. That is all it took to bring down the world. So, how did a simple change create such carnage?
First, the fault occurred during a routing update between security software and the Windows operating system, which is often the technology of choice across large organisations because of its resilience. The same issue did not manifest itself in other operating systems such as Linux, Mac OS or Ubuntu.
Secondly, there may be a ‘butterfly effect’ in play, where tiny variations in inputs can produce comparatively large output variations, leading to instability and unpredictable system behaviours. That doesn’t mean we can’t design for resilience and recovery, however.
I am not an insider so I have no internal information – only that which has been made public. But I will talk you through the process of ensuring high resilience and ask questions about what could have been done to reduce not only the impact but also the fault itself.
It is also important to note that this was not a security issue. It is an issue with a bug being inadvertently released into code, which resulted in this outage. It wasn't malicious and mistakes happen. For it to get this far, it must have been due to systemic failures through multiple organisations, not just the fault of a single engineer having the worst day of their career!
To explain how checks and balances can fail, resulting in an outage, here's a step-by-step breakdown of the potential issues that could have been encountered at each stage of the software deployment and operations process:
- Inadequate testing: The faulty code that initially triggered the outage was not adequately tested against the operating systems it was meant to support. CrowdStrike's oversight allowed a critical bug to pass through initial quality checks undetected. In modern software engineering, many layers of test automation can be applied to reduce the likelihood of bugs. Some organisations even take this as far as intentionally breaking their own systems in production to ensure they can respond effectively.
- Change management: The process for accepting changes and moving them into production was flawed. The typical practice of decoupling the release to production from the actual launch allows for additional checks and balances that were seemingly bypassed or ineffective in this case.
- Quality and canary testing environments: These environments are designed to catch potential failures before they affect all users. However, it appears that the canary testing – deploying the change to a small segment of the environment first – did not function as intended, or the results were not adequately monitored and acted upon.
- Global rollout strategy: The rollout plan did not incorporate a phased, regional approach. Instead, the update was deployed globally all at once which amplified the impact of the outage. A staggered rollout could have contained the problem regionally, significantly reducing overall disruption.
- Delayed diagnosis: Diagnosing the problem took an unusually long time, indicating potential gaps in monitoring and alerting systems. Effective diagnostic tools and processes should have identified and isolated the issue much sooner to prevent global impact and alerted relevant parties.
- Lack of rollback strategy: No clear rollback strategy was in place, making it impossible to quickly revert to a previous stable version of the software. This lack of a safety net prolonged the outage and complicated recovery efforts.
- Manual update requirement for fixes: The fix that was eventually developed required manual intervention to update devices rather than being pushed automatically. This approach slowed down the resolution process and increased the burden on both individual users and organisational IT teams.
- Centralised operations model: CrowdStrike's operating model, whether it was overly centralised or not adequately equipped for high availability, likely contributed to the problem. In high-stakes environments, operations should ideally be decentralised (following a ‘You Build It, You Run It’ model) to enhance responsiveness and accountability.
Each of these steps illustrates a failure in the typical safeguards that should prevent such widespread software outages. This worst-case scenario highlights how crucial robust testing, monitoring and operational procedures are in software development and deployment.
As CrowdStrike faces severe financial, reputational and potential legal repercussions from its recent mishap – a staggering $84 billion drop in valuation and widespread service disruptions impacting critical government infrastructure, transportation, and millions of customers – other organisations can learn crucial lessons to prevent similar disasters. This incident not only affected CrowdStrike but also led to a 0.74% decrease in Microsoft's value and caused billions in lost revenue for numerous companies, underscoring the far-reaching consequences of software failures.
What proactive steps can your businesses take to mitigate the risk of software failures?
Establish a robust delivery mechanism: Develop a product delivery system that allows for the rapid deployment of new features and ensures their safe delivery and resilience in the event of failures.
Integrate operations into development: Ensure that operational considerations are integral to your development pipeline. Choose the right operating model – be it a YBIYRI (You Build It, You Run It) or centralised operations – to suit your service needs.
Evaluate dependencies on major platforms: If your operations heavily rely on major platforms like Microsoft, assess the resilience of your systems against potential vulnerabilities within these platforms and their associated networks.
Work with an experienced partner: To reduce the risk of your organisation making the front page for all the wrong reasons, our Floodlight (discovery) service helps you understand your current capabilities and where you need to improve. You can find out more about Floodlight here.
By addressing these points, your business can enhance its readiness against software disruptions, ensuring continuous service and reliability and mitigating the risks of making unfortunate headlines around the world.
More Ideas
our thoughts
Unpacking the SCARF Model
Posted by John Stephenson . Oct 03.24
Have you ever found yourself or your colleagues reacting strongly to certain situations at work, seemingly out of nowhere? Maybe it was a change in job title, a shift in responsibilities or feeling left out of a meeting. John Stephenson unpacks the SCARF Model, a framework for motivating and understanding such behaviour...
> Readour thoughts
Unlocking the power of flow metrics
Posted by Steven Gibson . Sep 26.24
Our product landscape has become increasingly customer-centric. With the speed of market disruptors, organisations are constantly seeking ways to improve their ability to deliver value to customers and to pivot and respond. One powerful tool that has gained traction in recent years is the use of flow metrics. However, unlocking the power of flow metrics requires a holistic approach that goes beyond just measuring single team performance. In this article, we’ll explore the challenges and offer some guidance for leveraging flow metrics to optimise your entire value stream.
> Readour thoughts
Improving flow with layered test automation
Posted by Gareth Evans . Sep 19.24
The ability to deliver high-quality code quickly and consistently is paramount. Two concepts that have revolutionised our approach to this challenge are the Test Automation Pyramid and the principles of flow and feedback. In this blog, Gareth Evans explores how these ideas intersect and how adopting the Test Automation Pyramid can significantly enhance your development flow and feedback loops.
> Readour thoughts
Value is a flow
Posted by The HYPR Team . Sep 08.24
The HYPR team came together at their latest WorkerBee event and discussed why ‘building the right thing’ can’t be separated from ‘building the thing right’...
> Readour thoughts
Part 1 – Using OKRs to drive strategic growth
Posted by Martin Kearns . Sep 02.24
In part one, Martin Kearns takes lessons from the track and looks at how using OKRs drives strategic growth, as well as leveraging short-term milestones to test assumptions and stay on track.
> Read