Engineering Resilient Systems Towards Antifragility

When a legacy system actually “Gets Broken” (E.G., By Hurricane Sandy), Vast Technical Possibilities open for the Introduction of Far Superior Systems.

Resilience is generally understood to be the capacity of systems, sub-systems, or their individual elements to recover from disturbances and continue to function successfully. The term resilience is also used to describe the capability to not just recover from disturbances, but to actually benefit from them – to evolve and become better suited to perform the same tasks or to adapt to handle additional tasks. This phenomenon is analyzed in a thought-provoking book “Antifragile: Things That Gain from Disorder” (2012, by Nassim Nicholas Taleb), which explains why the antonym of fragile is ‘antifragile’ (Taleb’s neologism, due to the absence of an existing precise antonym) and not any of the following: resilient, elastic, strong, hardened, rigid, robust, solid, unbreakable, or “something-proof”. For this article, the broader meaning of the term resiliency is used - one that includes antifragility.

In the past it was not understood that a system could become stronger or better when exposed to stress (i.e., antifragile). However, developments and standardization that occurred over the past decades in the fields of micro-electronics, material science, communication technologies, along with advancements in software development tools and techniques have made the engineering of antifragile systems possible.

Traditional design techniques use predetermined design criteria while antifragile systems are built to additionally take into account unknowns. The ever-increasing rate of knowledge accumulation and technological advancements is experienced through an increased frequency of life-changing technological improvements. Nevertheless, the way these systems are used, the risks and environment in which they are used, or how they develop in the future is becoming increasingly unpredictable (e.g., electricity, DNA, the Internet).

In the face of certain yet unpredictable changes, two mutually exclusive engineering approaches are available:

Using traditional engineering methods to build hardened systems whose life cycle may nevertheless become progressively shorter due to their inability to adjust to changed environment and/or need for higher level of performance; or
Acknowledging the fact that both the environment and functional requirements will change over time, and accommodating for the unknown by intentionally building innovative and flexible systems that can evolve when needed.

Technologies that allow inclusion of flexibility and future-proof (anti-fragility) criteria into a system’s functional requirements (design criteria) have been around for at least two decades and are considered proven (e.g., microprocessor-based hardware, object-oriented software, standardized sensors, routable protocols, fiber-optic communication). Therefore, as it is possible to build new systems or upgrade the existing in a manner that would enable their easier evolution in the future (minimizing downtime), engineering of contemporary systems should include a full set of universally applicable resilience measures. The following engineering and organizational concepts, measures, and features should be specified and applied to systems expected to be resilient:

1. Thorough Understanding of a System’s Functional Requirements

A system’s functional requirements can often be satisfied far better using a thorough, open, and educated brainstorming process with all stakeholders. Experience teaches that thinking outside of the box and having a diverse, multidiscipline approach (fresh-eye perspectives on goals and issues from different angles, problem solving skills outside the established line of engineering thought, etc.) – while not assuming the design methods and solutions even for traditional and routine engineering tasks are known in advance – provide for creativity and lead to simpler, lower cost, yet superior, modern, and future-proof systems and solutions that are developed faster in a less risky way.

2. Defining Failure Criteria

Designing a system with “no single point of failure” and components “fail-safe” criteria is essential for its future resilience. Additionally, consideration of effects (and cost) of redundancy options shall be based on the system’s actual functional requirements, and should include all of the following:

the utilization of diverse sub-components to prevent common mode failures;
possible swarming strategies (allowing a certain percentage of components to fail without affecting the operation outcome);
evaluation of hardening vs. strategic sacrificing of individual components or sub-systems (partial reduction of service operation);
determination of an acceptable level of the system’s temporary degradation and the maximum recovery time taking into account user and public safety and other relevant issues; and
volume and distribution of spare parts, etc.

3. Modularity and Standardization

Another feature that should be incorporated into a resilient system is ensuring that system performance issues or failures are localized and manageable. To achieve this, it is essential that subsystems and/or individual components are functionally segregated and that their internal operation, external interfaces, and communication method(s) are well-defined, documented, and standardized.

Functional, performance-based specifications that rule out proprietary methods and components are necessary for ensuring the very basis for the system’s flexibility. Gradual modification of a system is more easily accomplished if it is modular, consisting of standardized, interoperable, commercial off-the-shelf (COTS) components. The system’s fault tolerance, hence resiliency, is also improved and redundancy accomplished with fewer additional components when they are intelligently organized to take over the role of a failed component.

4. Active Continuous Monitoring

Resiliency is also supported by continuous monitoring of a system’s state, including active self-diagnostics – to the lowest replaceable system component level. This allows for early identification, timely intervention, and accurate management of potential issues and emerging problems.

The benefits of accurate feedback about the system’s state, data mining, trending, asset management, intelligent preventive maintenance, when combined with built-in modularity and standardization, bring one important additional benefit with each identified issue: opportunity to re-evaluate the system and improve it accordingly. In other words, what causes one component to break, as long as it does not break the system, should also contribute to the system’s overall resilience (antifragility) – that is, if the system is capable of evolving. Even if a disruptive event breaks a particular system, it can still contribute to improving the resilience of similar systems (e.g., aircraft manufacturing and maintenance).

Building state of the art systems that achieve resiliency through well-thought-out design provides both direct and indirect benefits to the designer and the stakeholders alike, as well as to the general public. Those benefits include higher system capability, reliability, safety, and quality, with reduced developmental and operational risks and costs.

While new infrastructure projects are good candidates for implementing novelty features including resilience, repairing and upgrading older infrastructure often present opportunities for dramatic qualitative improvements. And when a legacy system actually does “get broken” (e.g., by Hurricane Sandy), vast technical possibilities open for the introduction of far superior systems (easier implementation, simplification/consolidation, advanced and flexible features, etc.).

With its great engineering legacy and knowledge of modern technologies in this time of accelerating technological changes, WSP | Parsons Brinckerhoff is able to provide expert assistance to our global infrastructure clients, and offer them resilient, state of the art economical systems and solutions that will serve them well in the foreseeable future.

Reference

Taleb, Nassim. Antifragile: Things That Gain from Disorder. Random House. ISBN 9781400067824. 2012 http://www.amazon.com/Antifragile-Things-That-Disorder-Incerto/dp/0812979680
Jones, Kennie H. Engineering Antifragile Systems: A Change In Design Philosophy. NF1676L-18615, International Workshop: From Dependable to Resilient, from Resilient to Antifragile Ambients and Systems (ANTIFRAGILE 2014); 1st; 2-5 Jun. 2014; Hasselt; Belgium http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20140010075.pdf
De Florio, Vincenzo. Antifragility = Elasticity + Resilience + Machine Learning, Models and Algorithms for Open System Fidelity http://arxiv.org/pdf/1401.4862v1.pdf

Engineering Resilient Systems in Times of Accelerating Technological Advancements – Moving Towards Antifragility

1. Thorough Understanding of a System’s Functional Requirements

2. Defining Failure Criteria

3. Modularity and Standardization

4. Active Continuous Monitoring

Reference

About WSP