Building Data-Centric Systems for Machine Learning Development and Operations
dc.contributor.author
Renggli, Cedric
dc.contributor.supervisor
Zhang, Ce
dc.contributor.supervisor
Alonso, Gustavo
dc.contributor.supervisor
Zou, James
dc.contributor.supervisor
Schelter, Sebastian
dc.date.accessioned
2022-06-01T12:54:47Z
dc.date.available
2022-06-01T12:13:55Z
dc.date.available
2022-06-01T12:54:47Z
dc.date.issued
2022
dc.identifier.uri
http://hdl.handle.net/20.500.11850/550162
dc.identifier.doi
10.3929/ethz-b-000550162
dc.description.abstract
Developing machine learning (ML) models can be seen as a process similar to the one established for traditional software development. Over the last years, practitioners have started adopting well-established concepts of classical software engineering to ML projects. One prominent example is MLOps, where tools and techniques from development and operations (DevOps) are transferred to ML to shorten the system development life cycle and provide continuous delivery into production while ensuring high quality of the artifacts.
A key difference between classical software and ML development lies in the strong dependence of ML models on the data used to train or evaluate their performance. Therefore, many tools and best-practices of DevOps can not be directly applied to ML workloads or could lead to lower quality if blindly taken over. In this thesis, we provide three novel data-centric solutions to support data scientists and ML engineers to handle ML workloads in a principled and efficient way: 1. We present Ease.ML/Snoopy, designed to perform a systematic and theoretically founded feasibility study before building ML applications. We approach this problem by estimating the irreducible error of the underlying task, also known as the Bayes error rate (BER), which stems from data quality issues in datasets used to train or evaluate ML model artifacts. We design a practical Bayes error estimator by aggregating over a collection of 1NN-based estimator over publicly available pre-trained feature transformations. Furthermore, by including our systematic feasibility study with additional signals into the iterative label cleaning process, we demonstrate in end-to-end experiments how users are able to save substantial labeling time and monetary efforts. 2. We propose SHiFT, the first task-aware (i.e., taking the dataset into account), flexible, and efficient model search engine for transfer learning, which can be seen as a data- and compute-efficient alternative to training models from scratch, in analogy to code-reuse in classical software engineering. The emergence of rich model repositories, such as TensorFlow Hub, enables practitioners and researchers to unleash the potential of these models across a wide range of tasks. As these repositories keep growing exponentially, efficiently selecting a good model for the task at hand becomes paramount. By carefully comparing various selection and search strategies, we realize that no single method outperforms the others, and hybrid or mixed strategies can be beneficial. The flexibility and efficiency of SHiFT are enabled by a custom query language SHiFT-QL together with a cost-based decision maker. Motivated by the iterative nature of machine learning development, we further support efficient incremental executions of our queries, which requires a careful implementation when jointly used with our optimizations. 3. We introduce Ease.ML/CI, the first continuous integration system for machine learning with statistical guarantees. The challenge of building Ease.ML/CI is to provide these rigorous guarantees (e.g., single accuracy point error tolerance with 0.999 reliability) with a practical amount of labeling effort (e.g., 2000 labels per test). We design a declarative scripting language that allows users to specify integration conditions with reliability constraints, and develop simple novel optimizations that can lower the number of labels required by up to two orders of magnitude for test conditions commonly used in real production systems.
en_US
dc.format
application/pdf
en_US
dc.language.iso
en
en_US
dc.publisher
ETH Zurich
en_US
dc.rights.uri
http://rightsstatements.org/page/InC-NC/1.0/
dc.subject
machine learning
en_US
dc.subject
Data management system
en_US
dc.subject
Data-centric engineering
en_US
dc.title
Building Data-Centric Systems for Machine Learning Development and Operations
en_US
dc.type
Doctoral Thesis
dc.rights.license
In Copyright - Non-Commercial Use Permitted
dc.date.published
2022-06-01
ethz.size
206 p.
en_US
ethz.code.ddc
DDC - DDC::0 - Computer science, information & general works::004 - Data processing, computer science
en_US
ethz.identifier.diss
28434
en_US
ethz.publication.place
Zurich
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02663 - Institut für Computing Platforms / Institute for Computing Platforms::09588 - Zhang, Ce (ehemalig) / Zhang, Ce (former)
en_US
ethz.leitzahl.certified
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02663 - Institut für Computing Platforms / Institute for Computing Platforms::09588 - Zhang, Ce (ehemalig) / Zhang, Ce (former)
en_US
ethz.date.deposited
2022-06-01T12:14:02Z
ethz.source
FORM
ethz.eth
yes
en_US
ethz.availability
Open access
en_US
ethz.rosetta.installDate
2022-06-01T12:54:55Z
ethz.rosetta.lastUpdated
2024-02-02T17:21:11Z
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Building%20Data-Centric%20Systems%20for%20Machine%20Learning%20Development%20and%20Operations&rft.date=2022&rft.au=Renggli,%20Cedric&rft.genre=unknown&rft.btitle=Building%20Data-Centric%20Systems%20for%20Machine%20Learning%20Development%20and%20Operations
Files in this item
Publication type
-
Doctoral Thesis [30292]