Author: Aaron Rinehart
“You keep using that word. I do not think it means what you think it means.”
In today's complex and ever-evolving cybersecurity landscape, resilience is an essential goal, but it’s true meaning is often misunderstood and underutilized. In the cyber security industry you are more likely to see “Cyber Resilience” used as a marketing buzzword, designed to sell security products than you are to see it used in the context of an engineering discipline such as Resilience Engineering. Traditionally, resilience in the context of cybersecurity has been viewed as a sequence of technical and procedural controls designed to prevent or recover from security incidents. However, this approach often falls short of addressing the fundamental complexity and dynamic nature of modern complex systems, both digital and human. More often than not, cybersecurity resembles an arcane art-a sequence of rituals aimed at checking boxes to satisfy regulatory or compliance standards, rather than fostering true resilience.
To break free from this performative approach, we must adopt a broader view that treats security as a subset of resilience, as is argued in the context of Security Chaos Engineering (SCE). This shift involves moving beyond compliance and prevention, focusing instead on the system's ability to adapt, recover, and thrive in the face of disruptions.
This article explores how the conventional definition of resilience in cybersecurity-rooted in control-based security measures-differs from the deeper, more dynamic concept of resilience from the field of resilience engineering, shaped by experts like David Woods, Richard Cook, and Erik Hollnagel.
Conventional Cybersecurity: The Performative Approach to Resilience
In cybersecurity, resilience has traditionally been defined in reactive terms. The focus is on resisting attacks and recovering from disruptions as quickly as possible. Systems are expected to “bounce back” from a failure or breach, with minimal downtime or loss of data. This mindset aligns closely with compliance requirements and standards, such as those imposed by frameworks like NIST, ISO, and PCI-DSS, where organizations must implement specific controls to meet regulatory expectations.
The conventional resilience model emphasizes:
- Resistance to Attacks: Security controls such as firewalls, encryption, and access control mechanisms are designed to block or resist unauthorized access or disruptions.
- Recovery from Disruptions: Disaster recovery plans, business continuity strategies, and incident response mechanisms are put in place to ensure that the system can return to normal operation after an attack or failure.
- Compliance and Performative Security: Many cybersecurity efforts are aimed at meeting regulatory or industry standards, ensuring that organizations can “check the box” to prove they are doing what is required. This compliance-driven approach often focuses more on satisfying external auditors than on creating resilient systems that adapt to evolving threats.
However, let’s add some practical, technical examples to illustrate the limitations of this approach:
Example 1: Network Firewalls and DDoS Attacks: Many organizations rely on network firewalls and intrusion prevention systems (IPS) to detect and block Distributed Denial of Service (DDoS) attacks. A traditional firewall-based solution may block certain IPs based on thresholds, but what happens if the attack shifts its strategy-using IP spoofing or slow-loris techniques to evade these rules? Slow Loris attacks are Denial of Service attacks that target web servers by holding open many simultaneous connections without completing them. Sending small, partial requests, slowly consuming the server's resources and eventually prevents the server from handling legitimate connections. This system, designed to react to specific attack patterns, can and will be rendered in brittle when faced with unexpected tactics.
Example 2: Patch Management: Patching known vulnerabilities is essential for preventing exploits. However, compliance-focused patching efforts often occur on fixed schedules-weekly or monthly. In fast-moving software environments, zero-day vulnerabilities or misconfigurations can exist in production long before a patch is released or applied. Rigid patching schedules may fail to address the underlying complexity of modern infrastructures, leaving gaps in protection.
The problem here is that the traditional approach assumes systems will behave predictably under stress. But in reality, complex systems fail in unexpected ways. Let’s now consider how resilience engineering can shift this perspective.
Resilience Engineering: A Shift Toward Adaptation and Learning
Resilience engineering, a field influenced by experts like David Woods, Dr. Nancy Leveson, Richard Cook, and Erik Hollnagel, takes a broader and more dynamic view of resilience. Instead of focusing solely on preventing or recovering from failure, resilience engineering emphasizes the system's ability to adapt, learn, and continuously improve in the face of both anticipated and unanticipated disruptions.
Several key concepts from resilience engineering stand in contrast to the traditional cybersecurity approach:
Adaptive Capacity: Resilience engineering defines resilience as the system’s ability to adjust its functioning in response to both expected and unexpected disruptions. David Woods emphasizes “adaptive capacity,” which refers to how well a system can flex under stress, absorbing shocks and adapting in real-time.
Embracing Complexity and Brittleness: Resilience engineering recognizes that complex systems are inherently brittle-prone to failure in ways that are difficult to predict or mitigate in advance. Instead of trying to eliminate brittleness, experts like Erik Hollnagel and Richard Cook focus on managing it through better understanding and adaptation.
Continuous Learning and Adaptation: In resilience engineering, learning from both successes and failures is critical. Rather than waiting for catastrophic failure to learn about system weaknesses, resilience engineers encourage continuous experimentation to identify areas of brittleness or vulnerability.
Proactive Experimentation: Security Chaos Engineering (SCE) applies chaos engineering principles to security. By running experiments that introduce stress or failure conditions in controlled ways, organizations can discover weaknesses in their systems before adversaries can exploit them.
Moving from Performative to Outcome-Driven Security
The traditional, compliance-driven approach to security often leads organizations to focus on performative work-implementing measures that may look good on paper but fail to truly protect the system from the unpredictability of real-world disruptions. This performative approach consumes valuable resources without necessarily improving the system's resilience.
By shifting the focus from security to resilience, organizations gain a superpower: the ability to allocate time, energy, and resources toward outcome-driven activities. This view allows teams to proactively enhance system resilience by observing how well the system adapts to real-world stressors. Security is no longer about meeting arbitrary requirements; instead, it's about continuously adapting the system to ensure it thrives, even in the face of uncertainty.
Security Chaos Engineering: A New Approach to Resilience
Security Chaos Engineering (SCE) embodies the principles of resilience engineering in the context of cybersecurity. It seeks to answer a critical question: How resilient are our systems to the conditions we are likely to face? Through experimentation, SCE uncovers unknown failure modes, misconfigurations, and hidden vulnerabilities that may otherwise go unnoticed until they cause major issues.
- Example: Cloud Security Posture Management (CSPM) Experimentation: In cloud environments, misconfigurations such as open S3 buckets or overly permissive IAM roles are common sources of security risks. By applying Security Chaos Engineering experiments-such as temporarily altering access controls or deleting certain permissions-security teams can learn how effective their CSPM tools are at detecting and preventing these issues. Additionally, it allows teams to assess how quickly they can revert permissions and restore security posture without human intervention.
- Example: Disable System Security Mechanisms: Disable security mechanisms (e.g., SELinux, AppArmor, endpoint protection): Are you detecting when security mechanisms are disabled? How quickly do you restore the security mechanisms?
The Future of Security Lies in Resilience
Resilience matters not only in cybersecurity but in all complex systems, especially those involving both human and machine interactions. Traditional cybersecurity practices, rooted in compliance and rigid control-based approaches, are no longer sufficient to protect organizations from the growing complexities and uncertainties of modern threats.
Resilience engineering offers a new perspective-one that treats failure as an opportunity to learn, adapt, and improve. By adopting principles from resilience engineering and applying them through Security Chaos Engineering, organizations can build systems that are not only more secure but more adaptable and capable of thriving in an uncertain world.
As cybersecurity evolves, adopting a resilience-based approach-one that emphasizes continuous learning, adaptation, and proactive experimentation- will allow organizations to transition from a reactive posture to one that anticipates and mitigates potential threats before they lead to business- impacting failures. This shift from performative security to outcome-driven resilience will define the future of cybersecurity.