Faulty testing software blamed by CrowdStrike for worldwide outage caused by bug
CrowdStrike has attributed a buggy update that caused 8.5 million Windows machines to crash worldwide to faulty testing software, as stated in a post incident review (PIR). The company explained that a bug in the Content Validator allowed an update with problematic data to pass validation. CrowdStrike has pledged to implement new measures to prevent a similar issue from occurring in the future.
The massive blue screen of death (BSOD) outage affected several businesses worldwide, including airlines, broadcasters, the London Stock Exchange and many others. The issue forced Windows machines into a boot loop, requiring technicians to gain local access to the machines to recover (Apple and Linux machines were not affected). Many companies, such as Delta Airlines, are still recovering.
To prevent DDoS and other types of attacks, CrowdStrike has a tool called Falcon Sensor. It comes with kernel-level content (called sensor content) that uses a “Template Type” to determine how it protects against threats. If something new appears, it sends “model instances” as “Rapid Response Content”.
The model type of the new sensor was released on March 5, 2024, and it performed as expected. However, on July 19, two new mock instances were released, and one (just 40KB in size) passed validation despite having “problematic data,” CrowdStrike said. “When the sensor was received and loaded into the Content Interpreter, [this] resulted in an out-of-bounds memory read that triggered an exception. This unexpected exception could not be handled gracefully, resulting in a Windows OS crash (BSOD).”
To prevent the incident from happening again, CrowdStrike promised to take several measures. The first is more in-depth testing of Rapid Response content, including local developer testing, content update and rollback testing, stress testing, stability testing, and more. It also adds validation checks and improves error handling.
In addition, the company will begin using a phased deployment strategy of Rapid Response content to avoid a repeat global outage. It also gives customers more control over the delivery of such content and provides publication information for updates.
However, some analysts and engineers feel that the company should have implemented such measures from the start. “CrowdStrike must have been aware that these updates are interpreted by drivers and can lead to problems,” engineer Florian Roth posted on X. “They should have implemented a phased deployment strategy for Rapid Response content from the start.”