Building Data-Centric Systems for Machine Learning Development and Operations

Renggli, Cedric

doi:10.3929/ethz-b-000550162

Download

Full text (PDF, 9.063Mb)

Open access

Author

Renggli, Cedric

Date

2022

Type

Doctoral Thesis

ETH Bibliography

yes

Altmetrics

Download

Full text (PDF, 9.063Mb)

Rights / license

In Copyright - Non-Commercial Use Permitted

Abstract

Developing machine learning (ML) models can be seen as a process similar to the one established for traditional software development. Over the last years, practitioners have started adopting well-established concepts of classical software engineering to ML projects. One prominent example is MLOps, where tools and techniques from development and operations (DevOps) are transferred to ML to shorten the system development life cycle and provide continuous delivery into production while ensuring high quality of the artifacts. A key difference between classical software and ML development lies in the strong dependence of ML models on the data used to train or evaluate their performance. Therefore, many tools and best-practices of DevOps can not be directly applied to ML workloads or could lead to lower quality if blindly taken over. In this thesis, we provide three novel data-centric solutions to support data scientists and ML engineers to handle ML workloads in a principled and efficient way: 1. We present Ease.ML/Snoopy, designed to perform a systematic and theoretically founded feasibility study before building ML applications. We approach this problem by estimating the irreducible error of the underlying task, also known as the Bayes error rate (BER), which stems from data quality issues in datasets used to train or evaluate ML model artifacts. We design a practical Bayes error estimator by aggregating over a collection of 1NN-based estimator over publicly available pre-trained feature transformations. Furthermore, by including our systematic feasibility study with additional signals into the iterative label cleaning process, we demonstrate in end-to-end experiments how users are able to save substantial labeling time and monetary efforts. 2. We propose SHiFT, the first task-aware (i.e., taking the dataset into account), flexible, and efficient model search engine for transfer learning, which can be seen as a data- and compute-efficient alternative to training models from scratch, in analogy to code-reuse in classical software engineering. The emergence of rich model repositories, such as TensorFlow Hub, enables practitioners and researchers to unleash the potential of these models across a wide range of tasks. As these repositories keep growing exponentially, efficiently selecting a good model for the task at hand becomes paramount. By carefully comparing various selection and search strategies, we realize that no single method outperforms the others, and hybrid or mixed strategies can be beneficial. The flexibility and efficiency of SHiFT are enabled by a custom query language SHiFT-QL together with a cost-based decision maker. Motivated by the iterative nature of machine learning development, we further support efficient incremental executions of our queries, which requires a careful implementation when jointly used with our optimizations. 3. We introduce Ease.ML/CI, the first continuous integration system for machine learning with statistical guarantees. The challenge of building Ease.ML/CI is to provide these rigorous guarantees (e.g., single accuracy point error tolerance with 0.999 reliability) with a practical amount of labeling effort (e.g., 2000 labels per test). We design a declarative scripting language that allows users to specify integration conditions with reliability constraints, and develop simple novel optimizations that can lower the number of labels required by up to two orders of magnitude for test conditions commonly used in real production systems. Show more

Permanent link

https://doi.org/10.3929/ethz-b-000550162

Publication status

published

External links

Search print copy at ETH Library

Contributors

Examiner: Zhang, Ce
Examiner: Alonso, Gustavo
Examiner: Zou, James
Examiner: Schelter, Sebastian

Publisher

ETH Zurich

Subject

machine learning; Data management system; Data-centric engineering

Organisational unit

09588 - Zhang, Ce (ehemalig) / Zhang, Ce (former)

More

Show all metadata

ETH Bibliography

yes

Altmetrics

Research Collection

Search

Building Data-Centric Systems for Machine Learning Development and Operations Mendeley CSV RIS BibTeX

Building Data-Centric Systems for Machine Learning Development and Operations

Mendeley

CSV

RIS

BibTeX