Advancing Algorithms and Applications for Data Valuation in Machine Learning
Open access
Autor(in)
Datum
2023Typ
- Doctoral Thesis
ETH Bibliographie
yes
Altmetrics
Abstract
"How much is my data worth?" is an increasingly common question posed by organizations and individuals alike. An answer to this question could allow, for instance, fairly distributing profits among multiple data contributors, determining prospective compensation when data breaches happen. This dissertation takes a first step toward data valuation by presenting a principled framework utilizing the Shapley value, a popular notion of value which originated in cooperative game theory.
First, we show that the Shapley value defines a unique payoff scheme that satisfies many desiderata for the notion of data value. However, the Shapley value often requires exponential time to compute. To meet this challenge, we propose efficient algorithms for approximating the Shapley value with provable error bounds for general machine learning (ML) utilities. Alongside its theoretical robustness, our empirical findings indicate that the Shapley value aligns with people’s intuitive understanding of data value.
Second, we present a family of efficient algorithms for computing the exact Shapley values for KNN classification and regression. We demonstrate that both the exact algorithm and the approximate algorithm for KNN Shapley can scale to millions of data points, making them suitable for valuing data in common ML datasets.
Lastly, we explore the practical challenges that data marketplaces are facing focusing on two main concerns: Training machine learning models on private data and curating specialized and complex datasets. To study and address these challenges, we demonstrate a decentralized design of a marketplace for private data and incentivize the creation of a real-world ecological dataset benchmark. Mehr anzeigen
Persistenter Link
https://doi.org/10.3929/ethz-b-000625405Publikationsstatus
publishedExterne Links
Printexemplar via ETH-Bibliothek suchen
Beteiligte
Referent: Zhang, Ce
Referent: Alonso, Gustavo
Referent: Rekatsinas, Theodoros
Referent: Interlandi, Matteo
Verlag
ETH ZurichThema
Data valuation; Machine Learning; Shapley Value; Game TheoryOrganisationseinheit
09588 - Zhang, Ce (ehemalig) / Zhang, Ce (former)
ETH Bibliographie
yes
Altmetrics