In the avionics domain, "ultra-reliability" refers to the practice of ensuring
quantifiably negligible residual failure rates in the presence of transient and
permanent hardware faults. If autonomous Cyber-Physical Systems (CPS) in other
domains, e.g., autonomous vehicles, drones, and industrial automation systems,
are to permeate our everyday life in the not so distant future, then they also
need to become ultra-reliable. However, the rigorous reliability engineering
and analysis practices used in the avionics domain are expensive and time
consuming, ...
In the avionics domain, "ultra-reliability" refers to the practice of ensuring
quantifiably negligible residual failure rates in the presence of transient and
permanent hardware faults. If autonomous Cyber-Physical Systems (CPS) in other
domains, e.g., autonomous vehicles, drones, and industrial automation systems,
are to permeate our everyday life in the not so distant future, then they also
need to become ultra-reliable. However, the rigorous reliability engineering
and analysis practices used in the avionics domain are expensive and time
consuming, and cannot be transferred to most other CPS domains. The increasing
adoption of faster and cheaper, but less reliable, Commercial Off-The-Shelf
(COTS) hardware is also an impediment in this regard.
Motivated by the goal of ultra-reliable CPS, this dissertation shows how to
soundly analyze the reliability of COTS-based implementations of actively
replicated Networked Control Systems (NCSs)—which are key building blocks of
modern CPS—in the presence of transient hardware faults. When an NCS is
deployed over field buses such as the Controller Area Network (CAN), transient
faults are known to cause host crashes, network retransmissions, and incorrect
computations. In addition, when an NCS is deployed over point-to-point networks
such as Ethernet, even Byzantine errors (i.e., inconsistent broadcast
transmissions) are possible. The analyses proposed in this dissertation account
for NCS failures due to each of these error categories, and consider NCS
failures in both time and value domains. The analyses are also provably free of
reliability anomalies. Such anomalies are problematic because they can result
in unsound failure rate estimates, which might lead us to believe that a system
is safer than it actually is.
Specifically, this dissertation makes four main contributions. (1) To reduce
the failure rate of NCSs in the presence of Byzantine errors, we present a hard
real-time design of a Byzantine Fault Tolerance (BFT) protocol for
Ethernet-based systems. (2) We then propose a quantitative reliability analysis
of the presented design in the presence of transient faults. (3) Next, we
propose a similar analysis to upper-bound the failure probability of an
actively replicated CAN-based NCS. (4) Finally, to upper-bound the long-term
failure rate of the NCS more accurately, we propose analyses that take into
account the temporal robustness properties of an NCS expressed as weakly-hard
constraints.
By design, our analyses can be applied in the context of full-system analyses.
For instance, to certify a system consisting of multiple actively replicated
NCSs deployed over a BFT atomic broadcast layer, the upper bounds on the
failure rates of each NCS and the atomic broadcast layer can be composed using
the sum-of-failure-rates model.
Read more