Open access
Author
Date
2022Type
- Master Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Machine Learning model training is costly and time-consuming. According
to recent research, the bottleneck of the model training process lies in the input
data processing stages. tf.data.service, as well as its extension Cachew,
tries to solve this problem by disaggregating the input pipelines from the
model and moving them onto the cloud. This successfully removes the bottleneck
from the input pipeline. However, such an approach introduces extra
cost from the cloud service and fails to fully utilize the computation resources
on the host machine. In this thesis, we discuss two different approaches to
solving this problem: utilizing local workers and pipeline splitting, and propose
a final policy integrating both of them to minimize the extra cost while
keeping the pipeline fast enough. This policy is implemented upon v2.8
of tf.data and tested on different input pipelines, seeing a 9% to 26% cost
saving compared to Cachew’s autoscaling policy. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000563201Publication status
publishedContributors
Examiner: Klimovic, Ana
Volume
Publisher
ETH Zurich, Department of Computer Science, Systems GroupOrganisational unit
09683 - Klimovic, Ana / Klimovic, Ana
More
Show all metadata
ETH Bibliography
yes
Altmetrics