Improving Network Failure Detection and Recovery with Programmable Data Planes

Costa Molero, Edgar

doi:10.3929/ethz-b-000690095

Download

Full text (PDF, 5.060Mb)

Open access

Author

Costa Molero, Edgar

Date

2024

Type

Doctoral Thesis

ETH Bibliography

yes

Altmetrics

Download

Full text (PDF, 5.060Mb)

Rights / license

In Copyright - Non-Commercial Use Permitted

Abstract

Since its creation, the Internet has grown exponentially in size and use cases, becoming an integral part of our society. Its seamless operation is often taken for granted; we only recognize its importance when disruptions occur. The current Internet’s complexity and scale make it prone to all sorts of failures, with each minute of downtime costing companies millions of dollars and damaging their reputation. In this thesis, we address the critical need for rapid detection and recovery mechanisms for network failures. We expand beyond conventional hard failures to explore and address the issue of gray failures in ISP networks, a subtle and poorly understood issue for which operators lack effective solutions. By leveraging advances in programmable data planes, we develop two systems to detect, localize, and recover from network failures. First, we introduce FANcY, a novel system to detect and localize gray failures in ISP networks. FANcY utilizes programmable switches to implement a reliable synchronization and counting protocol, enabling precise packet loss detection. FANcY adapts to the limited memory capacity of modern switches with a hybrid approach: dedicated counters for high-priority traffic and a probabilistic data structure for best-effort traffic. This design ensures efficient monitoring under various conditions and future-proofs the system against constantly increasing traffic volumes. We demonstrate FANcY’s capability for sub-second gray failure detection and reaction through extensive simulations and a prototype running on Intel Tofino switches. Second, we present our work on hardware-accelerated network control planes. This research extends beyond detection, demonstrating that programmable data planes can run critical control plane functions traditionally implemented in software. Our working prototype efficiently runs diverse such tasks in the data plane including: detecting hard, gray, and remote failures, notifying other devices, executing distributed path-vector computations that adhere to shortest-path and BGP-like policies, and rapidly updating forwarding states to restore connectivity after failures. Finally, our work identifies challenges in expressiveness and scalability for programmable data planes, emphasizing that the careful selection of tasks for offloading remains a critical area for future research. Show more

Permanent link

https://doi.org/10.3929/ethz-b-000690095

Publication status

published

External links

Search print copy at ETH Library

Contributors

Examiner: Vanbever, Laurent
Examiner: Vissicchio, Stefano
Examiner: Yu, Minlan

Publisher

ETH Zurich

Subject

Computer networks; Failure Detection and Recovery; Programmable data planes; Hardware acceleration; Hardware offloading

Organisational unit

09477 - Vanbever, Laurent / Vanbever, Laurent

More

Show all metadata

ETH Bibliography

yes

Altmetrics

Research Collection

Search

Improving Network Failure Detection and Recovery with Programmable Data Planes Mendeley CSV RIS BibTeX

Improving Network Failure Detection and Recovery with Programmable Data Planes

Mendeley

CSV

RIS

BibTeX