Skip to content

The global tech outage showed how we’re just one mistake away from chaos

In case last week’s events — an assassination attempt, a new Republican vice presidential nominee, the sitting president contracting Covid before dropping his reelection bid — didn’t leave you feeling sufficiently anxious about the fragility of the global order, let’s not forget that a...

Friday's global CrowdStrike outage was bad. It could have been much worse.
Friday's global CrowdStrike outage was bad. It could have been much worse.

The global tech outage showed how we’re just one mistake away from chaos

While you might not have known the name CrowdStrike before, it’s unlikely you’ll forget it soon. With a single bug in a routine software update, the company triggered what was likely the biggest computer outage in history — creating the kind of tech meltdown that its products are designed to prevent.

While CrowdStrike said the flawed update had been rolled back, the problems it caused aren’t exactly the old “turn it off and turn it back on” solutions most of us are accustomed to. As my colleague Brian Fung reported, the bug that put Windows computers into Blue Screen of Death mode is fixable. But in many cases, it requires painstaking work by a human being.

Now might be a good moment to buy your IT staff some good coffee and a bagel spread because each and every affected device — for some organizations, we’re talking thousands — will likely have to be assessed by an admin and rebooted into safe mode, and then the offending file can be deleted by hand.

“You can’t automate that,” said Kevin Beaumont, a security researcher and former Microsoft threat analyst, in a post on X. “So this is going to be incredibly painful for CrowdStrike customers.”

And even if your business had nothing to do with CrowdStrike, the outage still might have ruined your day.

Think of a cafe that uses third-party online reservation services, contracts out its delivery orders and accepts credit and debit cards through its point of sale, which is connected to payment processor back-end systems. You didn’t have to be a CrowdStrike customer to get screwed by the company’s mistake, and that’s what made Friday’s outage so frustrating.

We’ve had scary outages before, and we will certainly have them again. But the scale of the CrowdStrike outage is once again underscoring just how interconnected the world has become through a network almost none of us understands and which is largely self-regulating.

“There are organizations that we’re heavily dependent upon that we don’t even realize how dependent we are until they stop functioning,” said Stuart Madnick, a professor of information technology at the MIT Sloan School of Management.

Microsoft estimated the CrowdStrike outage affected some 8.5 million Windows devices. Airlines canceled 5,000 flights around the world Friday, while delays persisted through the weekend and into Monday. Hospitals and government services were throttled, and in some areas 911 communications stopped working.

It’d be easy to put all the blame on CrowdStrike for its sloppy system update, or the airlines for not building robust backup protocols, or even Microsoft for dominating the personal computing market. But IT experts told me there are broader systemic problems at play here.

The centralized nature of cybersecurity companies means that we now have “a few big failure points,” said Anil Khurana, executive director of the Baratta Center for Global Business at Georgetown’s McDonough Business School. “That by itself is not bad, because proliferation actually makes diagnostics even more difficult.”

But companies need “a better model of operational redundancy and back-ups,” Khurana said. “Our tech platforms have a mix of legacy systems coupled with modern systems, which means that the weakest link determines the overall system performance. I call it a ‘house of cards’ model.”

Right now, there are safeguards in place, but regulators around the world have been snoozing on cybersecurity risk management. IT systems are truly critical infrastructure, Khurana said, which suggests they “ought to go through the same kind of rigor, testing and oversight that we see for the likes of Boeing or JPMorgan.”

I asked Madnick whether the world should expect more mass outages.

“This was pretty bad as it is,” he said. “Could it get worse? The answer is yes, it could.”

As onerous and time-consuming as the manual reboot of millions of devices is, Friday’s outage was ultimately a one-off mistake by a company that moved quickly to fix it.

A bad actor looking to do serious damage could use software to “make computers or other equipment blow up, catch fire, burn — in which case, you don’t just reboot it, it’s destroyed.”

OK, so there’s one nightmare scenario to make us all yearn to go live in a cave. But before you start stockpiling canned goods, Madnick has another way to look at our modern predicament.

“There are a lot of benefits that these technologies give us that really pay off, 99% of the time,” he said. The most important thing is to prepare for that 1% of times when things go wrong.

Despite not directly causing issues for CrowdStrike customers, the company's bug still affected numerous businesses, as many relied on third-party services integrated with the affected systems. The incident underscores the interconnected nature of our modern digital infrastructure, making it essential for businesses to have robust backup protocols and operational redundancy.

Read also:

Comments

Latest