Lessons from Today’s CrowdStrike Outage: How to Navigate Software Release Challenges

Josh Thorngren

July 19, 2024

Today was perhaps one of the biggest global IT outages to date, with systems across the globe going offline due to a bug introduced in a software update from CrowdStrike. Emergency services, aviation, banking—no industry was left untouched.

From what we know today, this outage was caused by a bug in CrowdStrike’s Falcon platform, which was triggered by a content update pushed to customers overnight. For customers running Falcon on Windows, this bug then caused system crashes and blue screens of death.

It may be some time before the full impact of this outage is determined. What’s clear is that the magnitude of the outage is unprecedented. So, instead of diving into ‘what happened’, let’s talk about the challenges of delivering global scale software and the things engineering teams (and company leaders) can do to improve reliability from a people, process, and tools perspective.

First off, building software is complicated. Doing this globally, even more so. Building software that’s free of all possible bugs and vulnerabilities is (all but) impossible, especially if you want to actually release that software to customers. So there’s no simple answer of “just catch and fix bugs”. Instead, there’s a multi-layered approach to reliability in how teams plan, develop, and release software updates.

‍

In Planning: Prioritize Tech Debt Next To Features (Not Below)

Organizations need to prioritize technical debt. Engineers often find that fixing tech debt takes second priority to feature development. Features drive revenue and are usually more enjoyable to work on, while fixes that customers may never notice often don’t get the same attention.

However, outages like today’s and the fact that over 75% of ransomware attacks target known but unpatched vulnerabilities clearly indicate that maintaining and updating existing codebases is just as crucial as adding new features.

Organizations must treat technical debt and security updates as a roadmap priority, continuously working on patches, updates, and bug fixes alongside new feature development. Technical debt should have a multi-sprint roadmap that outlines how teams are investing in ongoing quality and security improvements. This roadmap should be shared both internally and with customers to increase the importance and visibility of this work. Most importantly, executive leadership has to commit to protecting this work and ensuring that engineering teams aren’t continually asked to set it aside in favor of new features.

‍

In Development: Perform Rigorous Functional Testing

For a platform like CrowdStrike’s Falcon, regular updates to both the product and its definition files (which help it detect threats) are standard practice. These types of routine updates often follow different testing processes than larger feature updates, where developers exhaustively test the behavior of new code. This type of functional testing helps ensure that applications behave as expected, without introducing bugs or crashes.

No matter what type of update, even for minor changes, functional testing must be conducted. Many safety-critical industries, including automotive companies and medical device makers, mandate this type of testing for their software to prevent incidents like today’s. Whenever code changes, applications should be tested to ensure they behave correctly in both expected and unexpected conditions.

Fortunately, technologies like symbolic execution and generative AI have made it easier for engineering teams to automate the creation and execution of functional tests on a large scale. This automation helps identify and resolve potential issues before they can cause disruptions.

(In case you’re wondering, Mayhem uses AI, symbolic execution, fuzz testing, binary analysis - and more to create and run functional tests specific to your application, automating this step for engineering teams. Get in touch if you want to learn more.)

‍

At Release: Stagger Updates vs. All-at-Once

Many software vendors automate updates to ensure their customers always have the latest features and security patches. This automation is crucial not only for delivering new functionalities but also for promptly addressing vulnerabilities, such as zero-day threats, by quickly patching vulnerable systems without delays.

However, this same automation can cause widespread issues when a new bug is introduced, as it affects all customer systems simultaneously, leading to significant outages like today’s. The challenge is to reap the benefits of automation while mitigating its risks.

To achieve this balance, IT and DevOps teams should adopt a “canary in the coalmine” approach to releases. This means initially deploying updates to a small group (or groups) of systems before a full rollout. By monitoring the functionality within this smaller group, any issues can be identified and resolved early.

Once the update is verified and deemed stable within the initial group, a broader (automated) rollout can proceed to all customers.

This approach allows organizations to enjoy the advantages of automation while incorporating additional controls and checks to minimize disruptions if problems arise.

‍

Navigating Future Software Challenges

Today’s CrowdStrike outage highlights the challenges of managing large software updates. While it’s important to figure out exactly what caused the outage, it’s equally important to learn from today’s incident. Prioritizing technical debt alongside feature development, performing rigorous functional testing, and adopting a staggered release schedule are all critical steps toward improving software reliability and reducing the risk of major disruptions in the future.

‍

Prevent Disruptions With Mayhem’s Advanced Testing

Integrating security testing solutions into your software development process is essential to preventing disruptions like the CrowdStrike outage. Mayhem is an advanced security testing solution that uses AI, symbolic execution, and fuzz testing to automatically generate and run thousands of tests to identify defects in your software.

This automated approach not only improves test coverage but also ensures that potential defects are identified and addressed before they impact your systems. Mayhem’s automated regression testing ensures that updates don’t break critical functionality. By adopting Mayhem, you can significantly improve your software’s reliability and resilience, making your updates more secure and reducing the risk of major outages.

‍

Share this post

Lessons from Today’s CrowdStrike Outage: How to Navigate Software Release Challenges

In Planning: Prioritize Tech Debt Next To Features (Not Below)

In Development: Perform Rigorous Functional Testing

At Release: Stagger Updates vs. All-at-Once

Navigating Future Software Challenges

Prevent Disruptions With Mayhem’s Advanced Testing

Get a Demo

Or let us know if you have any questions

Complete API Security in 5 Minutes

Maximize Code Coverage in Minutes