Advancing Algorithms and Applications for Data Valuation in Machine Learning
Open access
Author
Date
2023Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
"How much is my data worth?" is an increasingly common question posed by organizations and individuals alike. An answer to this question could allow, for instance, fairly distributing profits among multiple data contributors, determining prospective compensation when data breaches happen. This dissertation takes a first step toward data valuation by presenting a principled framework utilizing the Shapley value, a popular notion of value which originated in cooperative game theory.
First, we show that the Shapley value defines a unique payoff scheme that satisfies many desiderata for the notion of data value. However, the Shapley value often requires exponential time to compute. To meet this challenge, we propose efficient algorithms for approximating the Shapley value with provable error bounds for general machine learning (ML) utilities. Alongside its theoretical robustness, our empirical findings indicate that the Shapley value aligns with people’s intuitive understanding of data value.
Second, we present a family of efficient algorithms for computing the exact Shapley values for KNN classification and regression. We demonstrate that both the exact algorithm and the approximate algorithm for KNN Shapley can scale to millions of data points, making them suitable for valuing data in common ML datasets.
Lastly, we explore the practical challenges that data marketplaces are facing focusing on two main concerns: Training machine learning models on private data and curating specialized and complex datasets. To study and address these challenges, we demonstrate a decentralized design of a marketplace for private data and incentivize the creation of a real-world ecological dataset benchmark. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000625405Publication status
publishedExternal links
Search print copy at ETH Library
Contributors
Examiner: Zhang, Ce
Examiner: Alonso, Gustavo
Examiner: Rekatsinas, Theodoros
Examiner: Interlandi, Matteo
Publisher
ETH ZurichSubject
Data valuation; Machine Learning; Shapley Value; Game TheoryOrganisational unit
09588 - Zhang, Ce (ehemalig) / Zhang, Ce (former)
More
Show all metadata
ETH Bibliography
yes
Altmetrics