Preventing future massive tech outages: U-M experts available to comment

July 19, 2024
Contact:

EXPERTS ADVISORY

The widespread tech outage affecting airports, hospitals, news stations and more is resolving, but the full consequences of the software bug have yet to be tallied. The problems stemmed from an error in a Crowdstrike software update running on Windows computers.

University of Michigan experts can discuss how to prevent future bugs from causing disruption on this scale. Ang Chen and Ryan Huang, associate professors of computer science and engineering, co-lead an infrastructure resilience project aiming to produce better management software to govern various types of critical infrastructure.

Comments from Chen:

Ang Chen
Ang Chen

“Societal infrastructures are interdependent, often in ways that go unnoticed. In this case, the interdependency manifests at the software level, and problems can quickly propagate across infrastructures. An important challenge for the computing community is to systematically understand and manage infrastructure interdependencies in software for resilience. This should be done with better software redesigns that make interdependencies explicit, but also with more advanced analysis algorithms for managing various components in the complex infrastructure system.”

Contact: [email protected]

Comments from Huang:

Ryan Huang
Ryan Huang

“OS software is gigantic with tens of millions of lines of code, which makes it susceptible to bugs. With extensive efforts from both software vendors and research, OS reliability has improved quite a bit in the past decade, so we are seeing fewer errors like blue screen of death. But this incident is a reminder that there is still work to do for making the OS more resilient.

“This incident is caused by a kernel driver rather than the core Windows itself. An interesting fact about OS software is that the core kernel code is only a small portion of the large codebase, while the majority of the OS code is all sorts of drivers. Since the drivers run with the same privilege as the kernel, a bug in one driver can crash the entire OS. These drivers are developed by various third-party vendors other than the OS vendor (like Microsoft). As a result, they may not go through the same comprehensive testing as the kernel and it can be difficult to test them. Studies have shown that many OS errors are caused by buggy drivers and the driver code has several times higher error rate than the rest of the kernel.

“The culprit for this incident is not a typical hardware device driver but a driver (csagent.sys) for a cybersecurity program, CrowdStrike, which ironically is designed to protect Windows systems. The CrowdStrike driver requires running deep within Windows to detect threats, but doing so means that bugs in it will lead to system-wide failures. Vendors who develop driver code need to invest in more comprehensive testing and other reliability methods such as automated bug detection, safe programming language, and system verification. Another contributing factor to this catastrophic incident is the fact that the buggy update is rolled out very quickly. For critical software, haste often leads to mistakes. We need to anticipate that buggy updates can happen despite the best efforts to prevent them, so vendors use deployment policies for gradual roll-out as well as solutions to quickly detect and rollback buggy updates.

“There exist more advanced techniques to address this general problem, such as running drivers in less privileged mode, using separate fault domains to isolate the impact of buggy drivers, and minimizing the OS kernel. However, these techniques have limited use in practice because they incur performance costs. But when the stakes are high, we probably should consider paying the costs for a higher level of resilience.”

Contact: [email protected]