Dejan Kostic
EPFL
21 Nov 2008, 2:00 pm - 3:00 pm
Saarbrücken building E1 5, room rotunda 6th floor
SWS Colloquium
Distributed systems form the foundation of our society's infrastructure.
Complex distributed protocols and algorithms
are used in enterprise storage systems, distributed databases, large-scale
planetary systems, and sensor networks.
Errors in these protocols translate to denial of service to some clients,
potential loss of data, and even monetary
losses. Unfortunately, it is notoriously difficult to develop reliable
high-performance distributed systems that run
over asynchronous networks, such as the Internet. Even if a distributed system
is based on a well-understood
distributed algorithm, ...
Distributed systems form the foundation of our society's infrastructure.
Complex distributed protocols and algorithms
are used in enterprise storage systems, distributed databases, large-scale
planetary systems, and sensor networks.
Errors in these protocols translate to denial of service to some clients,
potential loss of data, and even monetary
losses. Unfortunately, it is notoriously difficult to develop reliable
high-performance distributed systems that run
over asynchronous networks, such as the Internet. Even if a distributed system
is based on a well-understood
distributed algorithm, its implementation can contain coding bugs and errors
arising from complexities of realistic
distributed environments.
This talk describes CrystalBall, a new approach for developing and deploying distributed systems. In CrystalBall, nodes predict distributed consequences of their actions, and use this information to detect and avoid errors. Each node continuously runs a state exploration algorithm on a recent consistent snapshot of its neighborhood and predicts possible future violations of specified safety properties. We describe a new state exploration algorithm, consequence prediction, which explores causally related chains of events that lead to property violation. Using CrystalBall, we identified new bugs in mature Mace implementations of a random overlay tree, BulletPrime content distribution system, and the Chord distributed hash table. Furthermore, we show that if the bug is not corrected during system development, CrystalBall is effective in steering the execution away from inconsistent states at run-time, with low false negative rates.
Read more
This talk describes CrystalBall, a new approach for developing and deploying distributed systems. In CrystalBall, nodes predict distributed consequences of their actions, and use this information to detect and avoid errors. Each node continuously runs a state exploration algorithm on a recent consistent snapshot of its neighborhood and predicts possible future violations of specified safety properties. We describe a new state exploration algorithm, consequence prediction, which explores causally related chains of events that lead to property violation. Using CrystalBall, we identified new bugs in mature Mace implementations of a random overlay tree, BulletPrime content distribution system, and the Chord distributed hash table. Furthermore, we show that if the bug is not corrected during system development, CrystalBall is effective in steering the execution away from inconsistent states at run-time, with low false negative rates.