Learning with biased data: invariant representations and target labels

Kehrenberg, Thomas Maximilian

Kehrenberg, Thomas Maximilian.pdf (5.01 MB)

Learning with biased data: invariant representations and target labels

thesis

posted on 2023-06-10, 00:53 authored by Thomas Maximilian Kehrenberg

Biased data represents a significant challenge for the proper functioning of machine learning models, which affects the trustworthiness of deployed models. These biases are usually introduced by the data generation process, i.e., data is collected from non-representative samples or is the result of biased processes. However, these data deficiencies can be very expensive or even impossible to fix, which makes it desirable to solve the problem on the algorithmic end. In this work, I consider two different forms of data bias: labelling bias and sampling bias; investigated under the framework of algorithmic fairness and evaluated using common fairness metrics. Labelling bias here refers to a systematic bias, correlated with a sensitive attribute, which causes the labels in the dataset to differ from the “true” labels; whereas sampling bias indicates that samples are missing from the training set in a systematic way, but are still present in the setting where the model is intended to be deployed. Both biases will make a naively trained model fail to generalize. I present three approaches to tackling this problem, each relying on some form of additional knowledge about the data. The first approach, dealing with labelling bias, is based on implicit, probabilistic target labels which satisfy certain given statistics. These target labels can be used to train any likelihood-based model. The second approach deals with strong spurious correlations in the training data, which can be seen as a specific form of sampling bias. A bias-free partially-labelled context set is used to learn an interpretable representation of the data which is invariant to the spurious correlation and can be assessed qualitatively. The third approach deals with less extreme cases of sampling bias, but relaxes the assumption of having labels in the context set, by learning an invariant representation via distribution matching.

History

File Version

Published version

Pages

160.0

Department affiliated with

Engineering and Design Theses

Qualification level

doctoral

Qualification name

phd

Language

eng

Institution

University of Sussex

Full text available

Yes

Legacy Posted Date

2021-09-08

Usage metrics

Keywords

Uncategorised value

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Learning with biased data: invariant representations and target labels

History

File Version

Pages

Department affiliated with

Qualification level

Qualification name

Language

Institution

Full text available

Legacy Posted Date

Usage metrics

Categories

Keywords

Licence

Exports