Unravelling the CrowdStrike Software Update

On Friday, July 19th, CrowdStrike (a cyber security company known for providing endpoint security, threat intelligence, and cyber attack response services) released a problematic update that caused nearly 8.5 million Windows PCs to be stuck in a booting loop resulting in widespread ‘Blue Screens of Death’ (BSOD).

As a result, the UK’s NHS service was hit by major disruption, leading to a ‘considerable’ backlog which will take ‘some time’ to clear. Meanwhile, thousands of flights were halted, leaving passengers stranded at airports worldwide, and businesses, banks, and retailers using Microsoft Windows operating systems were left in chaos.

A CrowdStrike outage update was released four days later to announce that a “significant” number of affected systems had been fixed, although recovery will take time.

Luckily for many, the update did not impact Mac or Linux systems.

But how could a single software update cause IT chaos worldwide, and what can we learn from this?

Identifying the root cause

The CrowdStrike blue screen of death is said to have been caused by a defective update for CrowdStrike’s Falcon product, designed to detect and respond to cyber attacks. The update triggered a kernel panic, leading to a crash on Windows machines.

💡 A kernel is the core part of an operating system that manages everything in the computer. It controls hardware, runs programs, and ensures everything works together smoothly. If something goes wrong in the kernel, it will cause serious issues, like crashes or security vulnerabilities.

According to Crowdstrike, the update did not include any new kernel drivers, but rather a configuration update that caused a logic error to occur within Crowdstrike’s existing kernel driver. Unlike regular software, Crowdstrike, similar to other host-based security solutions, have kernel-level access to be able to perform its duties in detecting and responding to intrusions. In order to ensure that software such as Crowdstrike can stay up-to-date with the most recent threats, these configuration updates occur several times a day. However, even with this frequency of updates, performing adequate testing of these updates for errors and bugs remains critical. Especially given the impact that a faulty update could have, as shown in this incident.

Many took to various platforms to bash CrowdStrike after numerous QA staff were recently laid off, claiming the incident serves as a stark reminder that tech teams are indispensable and testing environments should be up to scratch. Meanwhile, others have hit out at CrowdStrike employees for inadequate testing and rollout control, with some comparing the rollout to skydiving ‘without testing your parachute’!

Users across social media compared the CrowdStrike case to McAfee’s 2010 incident, in which CrowdStrike’s CEO, George Kurtz, acted as McAfee's CTO at the time. In a strikingly similar disaster, McAfee’s update mistakenly identified a critical Windows system file, svchost.exe, as malware, which inadvertently triggered an almost worldwide shutdown of Windows XP PCs.

In a recent interview with Today, George Kurtz, CEO of CrowdStrike, apologised to all customers:

How cyber criminals took advantage of the CrowdStrike incident

While CrowdStrike and Microsoft have worked together to resolve and mitigate the immediate damage, cyber criminals are now capitalising from the chaotic update through phishing and malware campaigns.

Cyber criminals took the opportunity to exploit the CrowdStrike incident by distributing malicious ZIP archives named ‘crowdstrike-hotfix.zip’, which contain a HijackLoader payload that loads RemCos. It’s thought that this campaign was created to target Latin America-based CrowdStrike customers.

Customers are advised to be wary of cyber criminals posing as CrowdStrike support, creating phishing campaigns to target affected customers, and selling scripts to automate recovery.

For more information and guidance, check out CrowdStrike's official guide and Microsoft's official guide.

Key learnings from the incident

The CrowdStrike disaster highlights the necessity of having a robust incident response plan. This includes predefined protocols for different types of incidents, communication strategies to keep stakeholders informed, and recovery procedures to restore services as quickly as possible. It is incredibly important to ensure that your business continuity processes are updated and that the teams exercise their ability to execute these when needed.

This is a timely opportunity for IT departments to review their failover and redundancy plans, as no matter how secure your infrastructure is, you can never be too safe with third parties involved. Understanding the risk due to the levels of access provided to third parties is crucial in ensuring that your processes account for these types of worst-case scenarios.

The CrowdStrike outage should also be a reminder to introduce backup systems that can take over during an outage, regularly test failover processes, and routinely conduct tabletop exercises. Updates should be extensively tested in a staging environment that closely mirrors the production environment to catch issues before they affect users. Implementing a phased rollout strategy to gradually deploy updates is always an idea — and never on a Friday!

And with the incident occurring following a round of CrowdStrike QA lay-offs, a final key learning is that tech teams are the bread and butter, along with a considerably rigorous QA process!

The recent CrowdStrike outage underscores the importance of preparedness, communication, resilience, and continuous improvement in handling cyber security incidents and maintaining trust with stakeholders.

Unravelling the CrowdStrike Software Update

Identifying the root cause

How cyber criminals took advantage of the CrowdStrike incident

Key learnings from the incident

Recommended

Developers and AI: Security Nightmare or Governance Opportunity?

Hack News: India Facing More Than 3,100 Cyber Attacks Per Week

Practical Malware Analysis for Beginners