Building Data-Centric Systems for Machine Learning Development and Operations

Renggli, Cedric

doi:10.3929/ethz-b-000550162

Show simple item record

dc.contributor.author

Renggli, Cedric

dc.contributor.supervisor

Zhang, Ce

dc.contributor.supervisor

Alonso, Gustavo

dc.contributor.supervisor

Zou, James

dc.contributor.supervisor

Schelter, Sebastian

dc.date.accessioned

2022-06-01T12:54:47Z

dc.date.available

2022-06-01T12:13:55Z

dc.date.available

2022-06-01T12:54:47Z

dc.date.issued

2022

dc.identifier.uri

http://hdl.handle.net/20.500.11850/550162

dc.identifier.doi

10.3929/ethz-b-000550162

dc.description.abstract

Developing machine learning (ML) models can be seen as a process similar to the one established for traditional software development. Over the last years, practitioners have started adopting well-established concepts of classical software engineering to ML projects. One prominent example is MLOps, where tools and techniques from development and operations (DevOps) are transferred to ML to shorten the system development life cycle and provide continuous delivery into production while ensuring high quality of the artifacts. A key difference between classical software and ML development lies in the strong dependence of ML models on the data used to train or evaluate their performance. Therefore, many tools and best-practices of DevOps can not be directly applied to ML workloads or could lead to lower quality if blindly taken over. In this thesis, we provide three novel data-centric solutions to support data scientists and ML engineers to handle ML workloads in a principled and efficient way: 1. We present Ease.ML/Snoopy, designed to perform a systematic and theoretically founded feasibility study before building ML applications. We approach this problem by estimating the irreducible error of the underlying task, also known as the Bayes error rate (BER), which stems from data quality issues in datasets used to train or evaluate ML model artifacts. We design a practical Bayes error estimator by aggregating over a collection of 1NN-based estimator over publicly available pre-trained feature transformations. Furthermore, by including our systematic feasibility study with additional signals into the iterative label cleaning process, we demonstrate in end-to-end experiments how users are able to save substantial labeling time and monetary efforts. 2. We propose SHiFT, the first task-aware (i.e., taking the dataset into account), flexible, and efficient model search engine for transfer learning, which can be seen as a data- and compute-efficient alternative to training models from scratch, in analogy to code-reuse in classical software engineering. The emergence of rich model repositories, such as TensorFlow Hub, enables practitioners and researchers to unleash the potential of these models across a wide range of tasks. As these repositories keep growing exponentially, efficiently selecting a good model for the task at hand becomes paramount. By carefully comparing various selection and search strategies, we realize that no single method outperforms the others, and hybrid or mixed strategies can be beneficial. The flexibility and efficiency of SHiFT are enabled by a custom query language SHiFT-QL together with a cost-based decision maker. Motivated by the iterative nature of machine learning development, we further support efficient incremental executions of our queries, which requires a careful implementation when jointly used with our optimizations. 3. We introduce Ease.ML/CI, the first continuous integration system for machine learning with statistical guarantees. The challenge of building Ease.ML/CI is to provide these rigorous guarantees (e.g., single accuracy point error tolerance with 0.999 reliability) with a practical amount of labeling effort (e.g., 2000 labels per test). We design a declarative scripting language that allows users to specify integration conditions with reliability constraints, and develop simple novel optimizations that can lower the number of labels required by up to two orders of magnitude for test conditions commonly used in real production systems.

en_US

dc.format

application/pdf

en_US

dc.language.iso

en

en_US

dc.publisher

ETH Zurich

en_US

dc.rights.uri

http://rightsstatements.org/page/InC-NC/1.0/

dc.subject

machine learning

en_US

dc.subject

Data management system

en_US

dc.subject

Data-centric engineering

en_US

dc.title

Building Data-Centric Systems for Machine Learning Development and Operations

en_US

dc.type

Doctoral Thesis

dc.rights.license

In Copyright - Non-Commercial Use Permitted

dc.date.published

2022-06-01

ethz.size

206 p.

en_US

ethz.code.ddc

DDC - DDC::0 - Computer science, information & general works::004 - Data processing, computer science

en_US

ethz.identifier.diss

28434

en_US

ethz.publication.place

Zurich

en_US

ethz.publication.status

published

en_US

ethz.leitzahl

ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02663 - Institut für Computing Platforms / Institute for Computing Platforms::09588 - Zhang, Ce (ehemalig) / Zhang, Ce (former)

en_US

ethz.leitzahl.certified

ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02663 - Institut für Computing Platforms / Institute for Computing Platforms::09588 - Zhang, Ce (ehemalig) / Zhang, Ce (former)

en_US

ethz.date.deposited

2022-06-01T12:14:02Z

ethz.source

FORM

ethz.eth

yes

en_US

ethz.availability

Open access

en_US

ethz.rosetta.installDate

2022-06-01T12:54:55Z

ethz.rosetta.lastUpdated

2024-02-02T17:21:11Z

ethz.rosetta.versionExported

true

ethz.COinS

ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Building%20Data-Centric%20Systems%20for%20Machine%20Learning%20Development%20and%20Operations&rft.date=2022&rft.au=Renggli,%20Cedric&rft.genre=unknown&rft.btitle=Building%20Data-Centric%20Systems%20for%20Machine%20Learning%20Development%20and%20Operations

Search print copy at ETH Library

Files in this item

Name:: Thesis_Cedric_Renggli.pdf
Size:: 9.063Mb
Format:: Adobe PDF
Label:: Full text

Download

Publication type

Doctoral Thesis [30292]

Show simple item record

Research Collection

Search

Building Data-Centric Systems for Machine Learning Development and Operations Mendeley CSV RIS BibTeX

Files in this item

Publication type

Building Data-Centric Systems for Machine Learning Development and Operations

Mendeley

CSV

RIS

BibTeX