Improving Network Failure Detection and Recovery with Programmable Data Planes
dc.contributor.author
Costa Molero, Edgar
dc.contributor.supervisor
Vanbever, Laurent
dc.contributor.supervisor
Vissicchio, Stefano
dc.contributor.supervisor
Yu, Minlan
dc.date.accessioned
2024-08-22T06:26:07Z
dc.date.available
2024-08-21T15:22:52Z
dc.date.available
2024-08-22T06:26:07Z
dc.date.issued
2024
dc.identifier.uri
http://hdl.handle.net/20.500.11850/690095
dc.identifier.doi
10.3929/ethz-b-000690095
dc.description.abstract
Since its creation, the Internet has grown exponentially in size and use cases, becoming an integral part of our society. Its seamless operation is often taken for granted; we only recognize its importance when disruptions occur. The current Internet’s complexity and scale make it prone to all sorts of failures, with each minute of downtime costing companies millions of dollars and damaging their reputation.
In this thesis, we address the critical need for rapid detection and recovery mechanisms for network failures. We expand beyond conventional hard failures to explore and address the issue of gray failures in ISP networks, a subtle and poorly understood issue for which operators lack effective solutions. By leveraging advances in programmable data planes, we develop two systems to detect, localize, and recover from network failures.
First, we introduce FANcY, a novel system to detect and localize gray failures in ISP networks. FANcY utilizes programmable switches to implement a reliable synchronization and counting protocol, enabling precise packet loss detection. FANcY adapts to the limited memory capacity of modern switches with a hybrid approach: dedicated counters for high-priority traffic and a probabilistic data structure for best-effort traffic. This design ensures efficient monitoring under various conditions and future-proofs the system against constantly increasing traffic volumes. We demonstrate FANcY’s capability for sub-second gray failure detection and reaction through extensive simulations and a prototype running on Intel Tofino switches.
Second, we present our work on hardware-accelerated network control planes. This research extends beyond detection, demonstrating that programmable data planes can run critical control plane functions traditionally implemented in software. Our working prototype efficiently runs diverse such tasks in the data plane including: detecting hard, gray, and remote failures, notifying other devices, executing distributed path-vector computations that adhere to shortest-path and BGP-like policies, and rapidly updating forwarding states to restore connectivity after failures. Finally, our work identifies challenges in expressiveness and scalability for programmable data planes, emphasizing that the careful selection of tasks for offloading remains a critical area for future research.
en_US
dc.format
application/pdf
en_US
dc.language.iso
en
en_US
dc.publisher
ETH Zurich
en_US
dc.rights.uri
http://rightsstatements.org/page/InC-NC/1.0/
dc.subject
Computer networks
en_US
dc.subject
Failure Detection and Recovery
en_US
dc.subject
Programmable data planes
en_US
dc.subject
Hardware acceleration
en_US
dc.subject
Hardware offloading
en_US
dc.title
Improving Network Failure Detection and Recovery with Programmable Data Planes
en_US
dc.type
Doctoral Thesis
dc.rights.license
In Copyright - Non-Commercial Use Permitted
dc.date.published
2024-08-22
ethz.size
168 p.
en_US
ethz.code.ddc
DDC - DDC::0 - Computer science, information & general works::004 - Data processing, computer science
en_US
ethz.identifier.diss
30252
en_US
ethz.publication.place
Zurich
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02140 - Dep. Inf.technologie und Elektrotechnik / Dep. of Inform.Technol. Electrical Eng.::02640 - Inst. f. Technische Informatik und Komm. / Computer Eng. and Networks Lab.::09477 - Vanbever, Laurent / Vanbever, Laurent
en_US
ethz.date.deposited
2024-08-21T15:22:53Z
ethz.source
FORM
ethz.eth
yes
en_US
ethz.identifier.internal
TIK-Schriftenreihe-Nr. 212
en_US
ethz.availability
Open access
en_US
ethz.rosetta.installDate
2024-08-22T06:26:09Z
ethz.rosetta.lastUpdated
2024-08-22T06:26:09Z
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Improving%20Network%20Failure%20Detection%20and%20Recovery%20with%20Programmable%20Data%20Planes&rft.date=2024&rft.au=Costa%20Molero,%20Edgar&rft.genre=unknown&rft.btitle=Improving%20Network%20Failure%20Detection%20and%20Recovery%20with%20Programmable%20Data%20Planes
Files in this item
Publication type
-
Doctoral Thesis [30239]