Register for our kickoff of the first phase of the SpringMo Black Wellness Initiative

It feels like a cyberattack, but it wasn’t.

On July 19, 2024, many airports stood still, unable to process the demands on their computer systems, which were not operable—not at all.  Car rental agencies and other companies that use Microsoft’s operating system and another program to run their businesses faced a similar problem.  Affected companies all used CrowdStrike’s Falcon to help secure their systems, and it was this combination of variables that caused the outages, some of which spanned many days.

What Happened

It was just an update, and automatic software updates happen all of the time.  Likely you have your home computers set up to accept them.  Your phones.

Problem is, the update that slowed the world beyond pre-internet business computing had a bug in it, a problem.  That problem was strong enough to crash computers and render them unusable until someone made a manual change to EACH computer. Each.

The result?  8.5 million Windows computers and servers stopped working. It wasn’t a cyberattack; it was a bug.

How’d It Happen?

Companies who develop the software we use typically operate with a development team (the people writing the software), a project manager, a person who “owns” each project’s terms and success or failure, and a quality assurance team, who do manual and automated testing against the software and its changes.

Adding to the complexity, there are often external configurations in settings files that interact with programs.  So, if you and I are running the same version of the same program, but our configuration files differ, then we may receive different results from the program.

On July 19, CrowdStrike’s Falcon EDR (endpoint detection and response—think of a very powerful and more robust antivirus program) software sent out an update that affected interactions with the kernel, the very core of the computing system.  It took down Microsoft Windows. Test teams had probably blessed the update, likely because their settings from configuration didn’t detect the problem and weren’t in sync with, oh, the rest of the world.  At least that’s what they told Congress.

From a Dark Reading article, “Essentially, the update caused Falcon Sensor to try and follow a threat detection configuration for which there were no corresponding rules on what to do. ‘If you think about a chessboard [and] trying to move a chess piece to some place where there’s no square,’ [Adam] Meyers said. ‘That’s effectively what happened inside the sensor. This was kind of a perfect storm of issues, ’”

Oops!

How Can We Avoid the Same Problem?

Two things can help us avoid a similar problem.  One needs to be implemented by software vendors, and the second by the people who purchase the software—at least in a business scenario. Consumers were unlikely to be using CrowdStrike Falcon, but if you were affected at home, I’d love to hear about it.

First, software vendors need to fully test using the actual configurations that their customers use. If they need to work with customers to obtain those, okay! Considering the 8.5 million affected computers, that was some pretty standardized configuration.

Second, companies need to interrupt the cycle of automated updates by testing them as they come through.  It’s difficult, time-consuming, and otherwise a pain—and even a risk, but for companies with robust handling who did this interrupt, they were able to stop the CrowdStrike Falcon automated update.

What Can We Learn?

Things are going to happen, and every organization from the smallest business and non-profit needs to understand what might happen if their computer systems were suddenly compromised.  How would they operate?  What manual processes should they revert to and for how long?

In the cybersecurity business, we call this an incident response plan that leads to a business continuity plan. Organizations need to conduct what are called “tabletop exercises” where they talk through scenarios like this one.  What would they do?  The best experience I can share regarding something like this was in the ’90s when I was working at KFC through college and the computerized registers went down.  We had to take customers’ orders on paper and manually figure tax by using the pricing on the menu boards.  And then when everything returned to normal, we had to reconcile what happened.  It was double the work.

Computers are amazing and truly help automate much of what used to be manual.  We can’t lose the knowledge of how to operate in an offline world, and we need to be intentional about testing changes in a formal or semi-formal change management process.

I hope no one was impacted in travel in July, but I recognize some of you may have been. Losing computer systems temporarily is a lot like losing power to your house.

It’s workable if you have a plan.

Heather Noggle

Owner, Codistac
hnoggle@codistac.com
https://www.codistac.com
https://www.linkedin.com/in/heathernoggle/

Related Posts