University of Sussex
Kehrenberg, Thomas Maximilian.pdf (5.01 MB)
Download file

Learning with biased data: invariant representations and target labels

Download (5.01 MB)
posted on 2023-06-10, 00:53 authored by Thomas Maximilian KehrenbergThomas Maximilian Kehrenberg
Biased data represents a significant challenge for the proper functioning of machine learning models, which affects the trustworthiness of deployed models. These biases are usually introduced by the data generation process, i.e., data is collected from non-representative samples or is the result of biased processes. However, these data deficiencies can be very expensive or even impossible to fix, which makes it desirable to solve the problem on the algorithmic end. In this work, I consider two different forms of data bias: labelling bias and sampling bias; investigated under the framework of algorithmic fairness and evaluated using common fairness metrics. Labelling bias here refers to a systematic bias, correlated with a sensitive attribute, which causes the labels in the dataset to differ from the “true” labels; whereas sampling bias indicates that samples are missing from the training set in a systematic way, but are still present in the setting where the model is intended to be deployed. Both biases will make a naively trained model fail to generalize. I present three approaches to tackling this problem, each relying on some form of additional knowledge about the data. The first approach, dealing with labelling bias, is based on implicit, probabilistic target labels which satisfy certain given statistics. These target labels can be used to train any likelihood-based model. The second approach deals with strong spurious correlations in the training data, which can be seen as a specific form of sampling bias. A bias-free partially-labelled context set is used to learn an interpretable representation of the data which is invariant to the spurious correlation and can be assessed qualitatively. The third approach deals with less extreme cases of sampling bias, but relaxes the assumption of having labels in the context set, by learning an invariant representation via distribution matching.


File Version

  • Published version



Department affiliated with

  • Engineering and Design Theses

Qualification level

  • doctoral

Qualification name

  • phd


  • eng


University of Sussex

Full text available

  • Yes

Legacy Posted Date


Usage metrics

    University of Sussex (Theses)


    No categories selected