Tuning Fairness by Balancing Target Labels

The issue of fairness in machine learning models has recently attracted a lot of attention as ensuring it will ensure continued confidence of the general public in the deployment of machine learning systems. We focus on mitigating the harm incurred by a biased machine learning system that offers better outputs (e.g., loans, job interviews) for certain groups than for others. We show that bias in the output can naturally be controlled in probabilistic models by introducing a latent target output. This formulation has several advantages: first, it is a unified framework for several notions of group fairness such as Demographic Parity and Equality of Opportunity; second, it is expressed as a marginalization instead of a constrained problem; and third, it allows the encoding of our knowledge of what unbiased outputs should be. Practically, the second allows us to avoid unstable constrained optimization procedures and to reuse off-the-shelf toolboxes. The latter translates to the ability to control the level of fairness by directly varying fairness target rates. In contrast, existing approaches rely on intermediate, arguably unintuitive, control parameters such as covariance thresholds.


INTRODUCTION
Algorithmic assessment methods are used for predicting human outcomes in areas such as financial services, recruitment, crime and justice, and local government. This contributes, in theory, to a world with decreasing human biases. To achieve this, however, we need fair machine learning models that take biased datasets, but output non-discriminatory decisions to people with differing protected attributes such as gender and marital status. Datasets can be biased because of, for example, sampling bias, subjective bias of individuals, and institutionalized biases (Olteanu et al., 2019;Tolan, 2019). Uncontrolled bias in the data can translate into bias in machine learning models.
There is no single accepted definition of algorithmic fairness for automated decision-making but several have been proposed. One definition is referred to as statistical or demographic parity. Given a binary protected attribute (e.g., married/unmarried) and a binary decision (e.g., yes/no to getting a loan), demographic parity requires equal positive rates (PR) across the two sensitive groups (married and unmarried individuals should be equally likely to receive a loan). Another fairness criterion, equalized odds (Hardt et al., 2016), takes into account the binary decision, and instead of equal PR requires equal true positive rates (TPR) and false positive rates (FPR). This criterion is intended to be more compatible with the goal of building accurate predictors or achieving high utility (Hardt et al., 2016). We discuss the suitability of the different fairness criteria in the discussion section at the end of the paper.
However, these existing approaches to balancing accuracy and fairness rely on intermediate, unintuitive control parameters such as allowable constraint violation ǫ (e.g., 0.01) in Agarwal et al. (2018), or a covariance threshold c (e.g., 0 that is controlled by another parameters τ and µ -0.005 and 1.2 -to trade off this threshold and accuracy) in Zafar et al. (2017a). This is related to the fact that many of these approaches embed fairness criteria as constraints in the optimization procedure (Quadrianto and Sharmanska, 2017;Zafar et al., 2017a,b;Donini et al., 2018).
In contrast, we provide a probabilistic classification framework with bias controlling mechanisms that can be tuned based on positive rates (PR), an intuitive parameter. Thus, giving humans the control to set the rate of positive predictions (e.g., a PR of 0.6). Our framework is based on the concept of a balanced dataset and introduces latent target labels, which, instead of the provided labels, are now the training label of our classifier. We prove bounds on how far the target labels diverge from the dataset labels. We instantiate our approach with a parametric logistic regression classifier and a Bayesian nonparametric Gaussian process classifier (GPC). As our formulation is not expressed as a constrained problem, we can draw upon advancements in automated variational inference Krauth et al., 2016;Gardner et al., 2018) for learning the fair model, and for handling large amounts of data.
The method presented in this paper is closely related to a number of previous works, e.g., Calders and Verwer, 2010;Kamiran and Calders, 2012. Proper comparison with them requires knowledge of our approach. We will thus explain our approach in the subsequent sections, and defer detailed comparisons to section 4.

TARGET LABELS FOR TUNING GROUP FAIRNESS
We will start by describing several notions of group fairness. For each individual, we have a vector of non-sensitive attributes x ∈ X , a class label y ∈ Y, and a sensitive attribute s ∈ S (e.g., racial origin or gender). We focus on the case where s and y are binary. We assume that a positive label y = 1 corresponds to a positive outcome for an individual-for example, being accepted for a loan. Group fairness balances a certain condition between groups of individuals with different sensitive attribute, s vs. s ′ . The termŷ below is the prediction of a machine learning model that, in most works, uses only non-sensitive attributes x. Several group fairness criteria have been proposed (e.g., Hardt et al., 2016;Chouldechova, 2017;Zafar et al., 2017a): Equality of positive rate (Demographic Parity): Equality of accuracy: Equality of true positive rate (Equality of Opportunity): Pr(ŷ = 1|s, y = 1) = Pr(ŷ = 1|s ′ , y = 1) .
Equalized odds criterion corresponds to Equality of Opportunity (3) plus equality of false positive rate.
The Bayes-optimal classifier only satisfies these criteria if the training data itself satisfies them. That is, in order for the Bayesoptimal classifier to satisfy demographic parity, the following must hold: P(y = 1|s) = P(y = 1|s ′ ), where y is the training label. We call a dataset for which P(y, s) = P(y)P(s) holds, a balanced dataset. Given a balanced dataset, a Bayes-optimal classifier learns to satisfy demographic parity and an approximately Bayesoptimal classifier should learn to satisfy it at least approximately. Here, we motivated the importance of balanced datasets via the demographic parity criterion, but it is also important for equality of opportunity which we discuss in section 2.1.
In general, however, our given dataset is likely to be imbalanced. There are two common solutions to this problem: either pre-process or massage the dataset to make it balanced, or constrain the classifier to give fair predictions despite it having been trained on an unbalanced dataset. Our approach takes parts from both solutions.
An imbalanced dataset can be turned into a balanced dataset by either changing the class labels y or the sensitive attributes s. In the use cases that we are interested in, s is considered an integral part of the input, representing trustworthy information and thus should not be changed. y, conversely, is often not completely trustworthy; it is not an integral part of the sample but merely an observed outcome. In a hiring dataset, for instance, y might represent the hiring decision, which can be biased, and not the relevant question of whether someone makes a good employee.
Thus, we introduce new target labelsȳ such that the dataset is balanced: P(ȳ, s) = P(ȳ)P(s). The idea is that these target labels still contain as much information as possible about the task, while also forming a balanced dataset. This introduces the concept of the accuracy-fairness trade-off: in order to be completely accurate with respect to the original (not completely trustworthy) class labels y, we would requireȳ = y, but then, the fairness constraints would not be satisfied.
Let η s (x) = P(y = 1|x, s) denote the distribution of y in the data. The target distributionη s (x) = P(ȳ = 1|x, s) is then given bȳ due to the marginalization rules of probabilities. The conditional probability P(ȳ|y, s) indicates with which probability we want to keep the class label. This probability could in principle depend on x which would enable the realization of individual fairness. The dependence on x has to be prior knowledge as it cannot be learned from the data. This prior knowledge can encode the semantics that "similar individuals should be treated similarly" (Dwork et al., 2012), or that "less qualified individuals should not be preferentially favored over more qualified individuals" (Joseph et al., 2016). Existing proposals for guaranteeing individual fairness require strong assumptions, such as the availability of an agreed-upon similarity metric, or knowledge of the underlying data generating process. In contrast, in group fairness, we partition individuals into protected groups based on some sensitive attribute s and ask that some statistics of a classifier be approximately equalized across those groups (see Equations 1-3). In this case, P(ȳ|y, s) does not depend on x.
Returning to Equation (4), we can simplify it with m s : = P(ȳ = 1|y = 1, s) + P(ȳ = 0|y = 0, s) − 1 (5) arriving atη s (x) = m s · η s (x) + b s . m s and b s are chosen such that P(ȳ, s) = P(ȳ)P(s). This can be interpreted as shifting the decision boundary depending on s so that the new distribution is balanced.
As there is some freedom in choosing m s and b s , it is important to consider what the effect of different values is. The following theorem provides this (the proof can be found in the Supplementary Material): Theorem 1. The probability that y andȳ disagree (y =ȳ) for any input x in the dataset is given by: where t s = m s + 2b s − 1 2m s .
Thus, if the threshold t s is small, then only if there are inputs very close to the decision boundary (η s (x) close to 1 2 ) would we haveȳ = y. t s determines the accuracy penalty that we have to accept in order to gain fairness. The value of t s can be taken into account when choosing m s and b s (see section 3). If η s satisfies the Tsybakov condition (Tsybakov et al., 2004), then we can give an upper bound for the probability.

Definition 1.
A distribution η satisfies the Tsybakov condition if there exist C > 0, λ > 0 and t 0 ∈ (0, 1 2 ] such that for all t ≤ t 0 , This condition bounds the region close to the decision boundary. It is a property of the dataset. Corollary 1.1. If η(x, s) = P(y = 1|x, s) satisfies the Tsybakov condition in x, with constants C and λ, then the probability that y andȳ disagree (y =ȳ) for any input x in the dataset is bounded by: Section 3 discusses how to choose the parameters forη in order to make it balanced.

Equality of Opportunity
In contrast to demographic parity, equality of opportunity (just as equality of accuracy) is satisfied by a perfect classifier. Imperfect classifiers, however, do not by default satisfy it: the true positive rate (TPR) is different for different subgroups. The reason for this is that while the classifier is optimized to have a high TPR overall, it is not optimized to have the same TPR in the subgroups.
The overall TPR is a weighted sum of the TPRs in the subgroups: TPR = P(s = 0|y = 1) · TPR s = 0 + P(s = 1|y = 1) · TPR s = 1 . (11) In datasets where the positive label y = 1 is heavily skewed toward one of the groups (say, group s = 1; meaning that P(s = 1|y = 1) is high and P(s = 0|y = 1) is low), overall TPR might be maximized by setting the decision boundary such that nearly all samples in s = 0 are classified as y = 0, while for s = 1 a high TPR is achieved. The low TPR for s = 0 is in this case weighted down and only weakly impacts the overall TPR. For s = 0, the resulting classifier uses s as a shorthand for y, mostly ignoring the other features. This problem usually persists even when s is removed from the input features because s is implicit in the other features.
A balanced dataset helps with this issue because in such datasets, s is not a useful proxy for the balanced labelȳ (because we have P(ȳ, s) = P(ȳ)P(s)) and s cannot be used as a shorthand. Assuming the dataset is balanced in s (P(s = 0) = P(s = 1)), for such datasets P(s = 0|y = 1) = P(s = 1|y = 1) holds and the two terms in Equation (11) have equal weight.
Here as well there is an accuracy-fairness trade-off: assuming the unconstrained model is as accurate as its model complexity allows, adding additional constraints like equality of opportunity can only make the accuracy worse.

Concrete Algorithm
For training, we are only given the unbalanced distribution η s (x) and not the target distributionη s (x). However,η s (x) is needed in order to train a fair classifier. One approach is to explicitly change the labels y in the dataset, in order to constructη s (x). We discuss this approach and its drawback in the related work section (section 4).
We present a novel approach which only implicitly constructs the balanced dataset. This framework can be used with any likelihood-based model, such as Logistic Regression and Gaussian Process models. The relation presented in Equation (4) allows us to formulate a likelihood that targetsη s (x) while only having access to the imbalanced labels y. As we only have access to y, P(y|x, s, θ ) is the likelihood to optimize. It represents the probability that y is the imbalanced label, given the input x, the sensitive attribute s that available in the training set and the model parameters θ for a model that is targetingȳ. Thus, we get As we are only considering group fairness, we have P(y = 1|ȳ, x, s, θ ) = P(y = 1|ȳ, s). Let f θ (x, y ′ ) be the likelihood function of a given model, where f gives the likelihood of the label y ′ given the input x and the model parameters θ . As we do not want to make use of s at test time, f does not explicitly depend on s. The likelihood with respect toȳ is then given by f : P(ȳ|x, s, θ ) = f θ (x,ȳ); and thus, does not depend on s. The latter is important in order to avoid direct discrimination (Barocas and Selbst, 2016). With these simplifications, the expression for the likelihood becomes The conditional probabilities, P(y|ȳ, s), are closely related to the conditional probabilities in Equation (4) and play a similar role of "transition probabilities." Section (1) explains how to choose these transition probabilities in order to arrive at a balanced dataset. For a binary sensitive attribute s (and binary label y), there are 4 transition probabilities (see Algorithm 1 where d s=j y=i : = P(y = 1|ȳ = i, s = j)): A perhaps useful interpretation of Equation (13) is that, even though we don't have access toȳ directly, we can still compute the expectation value over the possible values ofȳ. The above derivation applies to binary classification but can easily be extended to the multi-class case.

Algorithm 1: Fair learning with target labelsȳ
Output: Fair model parameters θ 1: Initialize θ (randomly) 2: for all x i , y i , s i do 3: update θ to maximize likelihood ℓ 12: end for

TRANSITION PROBABILITIES FOR A BALANCED DATASET
This section focuses on how to set values of the transition probabilities in order to arrive at balanced datasets.

Meaning of the Parameters
Before we consider concrete values, we give some intuition for the transition probabilities. Let s = 0 refer to the protected group. For this group, we want to make more positive predictions than the training labels indicate. Variableȳ is supposed to be our target proxy label. Thus, in order to make more positive predictions, some of the y = 0 labels should be associated withȳ = 1. However, we do not know which. So, if our model predictsȳ = 1 (high P(ȳ = 1|x, θ )) while the training label is y = 0, then we allow for the possibility that this is actually correct. That is, P(y = 0|ȳ = 1, s = 0) is not 0. If we choose, for example, P(y = 0|ȳ = 1, s = 0) = 0.3 then that means that 30% of positive target labelsȳ = 1 may correspond to negative training labels y = 0. This way we can have moreȳ = 1 than y = 1, overall. On the other hand, predictingȳ = 0 when y = 1 holds, will always be deemed incorrect: P(y = 1|ȳ = 0, s = 0) = 0; this is because we do not want any additional negative labels.
For the non-protected group s = 1, we have the exact opposite situation. If anything, we have too many positive labels. So, if our model predictsȳ = 0 (high P(ȳ = 0|x, θ )) while the training label is y = 1, then we should again allow for the possibility that this is actually correct. That is, P(y = 1|ȳ = 0, s = 1) should not be 0. On the other hand, P(y = 0|ȳ = 1, s = 1) should be 0 because we do not want additional positive labels for s = 1. It could also be that the number of positive labels is exactly as it should be, in which case we can just set y =ȳ for all data points with s = 1.

Choice of Parameters
A balanced dataset is characterized by an independence of the labelȳ and the sensitive attribute s. Given that we have complete control over the transition probabilities, we can ensure this independence by requiring P(ȳ = 1|s = 0) = P(ȳ = 1|s = 1). Our constraint is then that both of these probabilities are equal to the same value, which we will call the target rate PR t ("PR" as positive rate): (16) This leads us to the following constraints for s ′ ∈ {0, 1}: We call P(y = 1|s = j) the base rate PR j b which we estimate from the training set: number of points with y = 1 in group i number of points in group i .
Expanding the sum, we get This is a system of linear equations consisting of two equations (one for each value of s ′ ) and four free variables: P(ȳ = 1|y, s) with y, s ∈ {0, 1}. The two unconstrained degrees of freedom determine how strongly the accuracy will be affected by the fairness constraint. If we set P(ȳ = 1|y = 1, s) to 0.5, then this expresses the fact that a train label y of 1 only implies a target labelȳ of 1 in 50% of the cases. In order to minimize the effect on accuracy, we make P(ȳ = 1|y = 1, s) as high as possible and P(ȳ = 1|y = 0, s), conversely, as low as possible. However, the lowest and highest possible values are not always 0 and 1 respectively. To see this, we solve for P(ȳ = 1|y = 0, s = j) in Equation (18): If PR t/PR j b were greater than 1, then setting P(ȳ = 1|y = 0, s = j) to 0 would imply a P(ȳ = 1|y = 1, s = j) value greater than 1. A visualization that shows why this happens can be found in the Supplementary Material. We thus arrive at the following definitions: (20) Algorithm 2 shows pseudocode of the procedure, including the computation of the allowed minimal and maximal value. Once all these probabilities have been found, the transition probabilities needed for Equation (13) are fully determined by applying Bayes' rule: P(y = 1|ȳ, s) = P(ȳ|y = 1, s)P(y = 1|s) P(ȳ|s) .

Choosing a Target Rate
As shown, there is a remaining degree of freedom when targeting a balanced dataset: the target rate PR t := P(ȳ = 1). This is true for both fairness criteria that we are targeting. The choice of targeting rate affects how much η andη differ as implied by Theorem 1 (PR t affects m s and b s ).η should remain close to η asη only represents an auxiliary distribution that does not have meaning on its own. The threshold t s in Theorem 1 (Equation 8) gives an indication of how close the distributions are. With the definitions in Equations (20) and (21), we can express t s in terms of the target rate and the base rate: This shows that t s is smallest when PR s b and PR t are closest. However, as PR s b has different values for different s, we cannot set PR s b = PR t for all s. In order to keep both t s = 0 and t s = 1 small, it follows from Equation (23) that PR t should at least be between PR 0 b and PR 1 b . A more precise statement can be made when we explicitly want to minimize the sum t s = 0 +t s = 1 : assuming PR 0 b < PR t < PR 1 b and PR 1 b < 1 2 , the optimal choice for PR t is PR 1 b (see Supplementary Material for details). We call this choice PR max t . For PR 0 b > 1 2 , analogous statements can be made, but this is of less interest as this case does not appear in our experiments.
The previous statements about t s do not directly translate into observable quantities like accuracy if the Tsybakov condition is not satisfied, and even if it is satisfied, the usefulness depends on the constants C and λ. Conversely, the following theorem makes generally applicable statement about the accuracy that can be achieved. Before we get to the theorem, we introduce some notation. We are given a dataset D = {(x i , y i )} i , where the x i are vectors of features and the y i the corresponding labels. We refer to the tuples (x, y) as the samples of the dataset. The number of samples is N = |D|.
We assume binary labels (y ∈ {0, 1}) and thus can form the (disjoint) subsets Y 0 and Y 1 with Furthermore, we associate each sample with a classificationŷ ∈ {0, 1}. The task of making the classificationŷ = 0 orŷ = 1 can be understood as sorting each sample from D into one of two sets: C 0 and C 1 , such that C 0 ∪ C 1 = D and C 0 ∩ C 1 = ∅. We refer to the set A = (C 0 ∩ Y 0 ) ∪ (C 1 ∩ Y 1 ) as the set of correct (or accurate) predictions. The accuracy is given by acc = N −1 · |A|.
is called the base acceptance rate of the dataset D.

Definition 3.r
is called the predictive acceptance rate of the predictions.
Theorem 2. For a dataset with the base rate r a and corresponding predictions with a predictive acceptance rate ofr a , the accuracy is limited by Corollary 2.1. Given a dataset that consists of two subsets S 0 and S 1 (D = S 0 ∪ S 1 ) where p is the ratio of |S 0 | to |D| and given corresponding acceptance rates r 0 a and r 1 a and predictions with target ratesr 0 a andr 1 a , the accuracy is limited by The proofs are fairly straightforward and can be found in the Supplementary Material. Corollary 2.1 implies that in the common case where group s = 0 is disadvantaged (r 0 a < r 1 a ) and also underrepresented (p < 1 2 ), the highest accuracy under demographic parity can be achieved at PR t = r 1 a with

11: end if
However, this means willingly accepting a lower accuracy in the (smaller) subset S 0 that is compensated by a very good accuracy in the (larger) subset S 1 . A decidedly "fairer" approach is to aim for the same accuracy in both subsets. This is achieved by using the average of the base acceptance rates for the target rate. As we balance the test set in our experiments, this kind of sacrificing of one demographic group does not work there. We compare the two choices (PR max t and PR avg t ) in section 5.

Conditionally Balanced Dataset
There is a fairness definition related to demographic parity which allows conditioning on "legitimate" risk factors ℓ when considering how equal the demographic groups are treated (Corbett-Davies et al., 2017). This cleanly translates into balanced datasets which are balanced conditioned on ℓ: We can interpret this as splitting the data into partitions based on the value of ℓ, where the goal is to have all these partitions be balanced. This can easily be achieved by our method by setting a PR t (ℓ) for each value of ℓ and computing the transition probabilities for each sample depending on ℓ.

RELATED WORK
There are several ways to enforce fairness in machine learning models: as a pre-processing step (Kamiran and Calders, 2012;Zemel et al., 2013;Louizos et al., 2016;Lum and Johndrow, 2016;Chiappa, 2019;Quadrianto et al., 2019), as a postprocessing step (Feldman et al., 2015;Hardt et al., 2016), or as a constraint during the learning phase (Calders et al., 2009;Zafar et al., 2017a,b;Donini et al., 2018;Dimitrakakis et al., 2019). Our method enforces fairness during the learning phase (an in-processing approach) but, unlike other approaches, we do not cast fair-learning as a constrained optimization problem. Constrained optimization requires a customized procedure. In Goh et al. (2016), Zafar et al. (2017a), and Zafar et al. (2017b), suitable majorization-minimization/convexconcave procedures (Lanckriet and Sriperumbudur, 2009) were derived. Furthermore, such constrained optimization approaches may lead to more unstable training, and often yield classifiers with both worse accuracy and more unfair (Cotter et al., 2018). The approaches most closely related to ours were given by Kamiran and Calders (2012) who present four pre-processing methods: Suppression, Massaging the dataset, Reweighing, and Sampling. In our comparison we focus on methods 2, 3, and 4, because the first one simply removes sensitive attributes and those features that are highly correlated with them. All the methods given by Kamiran and Calders (2012) aim only at enforcing demographic parity.
The massaging approach uses a classifier to first rank all samples according to their probability of having a positive label (y = 1) and then flips the labels that are closest to the decision boundary such that the data then satisfies demographic parity. This pre-processing approach is similar in spirit to our inprocessing method but differs in the execution. In our method (section 3.2), "ranking" and classification happen in one step and labels are not explicitly flipped but assigned probabilities of being flipped.
The reweighting method reweights samples based on whether they belong to an over-represented or under-represented demographic group. The sampling approach is based on the same idea but works by resampling instead of reweighting. Both reweighting and sampling aim to effectively construct a balanced dataset, without affecting the labels. This is in contrast to our method which treats the class labels as potentially untrustworthy and allows defying them.
One approach in Calders and Verwer (2010) is also worth mentioning. It is based on a generative Naïve Bayes model in which a latent variable L is introduced which is reminiscent to our target labelȳ. We provide a discriminative version of this approach. In discriminative models, parameters capture the conditional relationship of an output given an input, while in generative models, the joint distribution of input-output is parameterized. With this conditional relationship formulation (P(y|ȳ, s) = P(ȳ|y,s)P(y|s) /P(ȳ|s)), we can have detailed control in setting the target rate. Calders and Verwer (2010) focuses only on the demographic parity fairness metric.

EXPERIMENTS
We compare the performance of our target-label model with other existing models based on two real-world datasets. These datasets have been previously considered in the fairness-aware machine learning literature.

Implementation
The proposed method is compatible with any likelihood-based algorithm. We consider both a non-parametric and a parametric model. The non-parametric model is a Gaussian process model, and logistic regression is the parametric counterpart. Since our fairness approach is not being framed as a constrained optimization problem, we can reuse off-the-shelf toolboxes including the GPyTorch library by Gardner et al. (2018) for Gaussian process models. This library incorporates recent advances in scalable variational inference including variational inducing inputs and likelihood ratio/REINFORCE estimators. The variational posterior can be derived from the likelihood and the prior. We need just need to modify the likelihood to take into account the target labels (Algorithm 1).

Data
We run experiments on two real-world datasets. The first dataset is the Adult Income dataset (Dua and Graff, 2019). It contains 33,561 data points with census information from US citizens. The labels indicate whether the individual earns more (y = 1) or less (y = 0) than $50,000 per year. We use the dataset with either race or gender as the sensitive attribute. The input dimension, excluding the sensitive attributes, is 12 in the raw data; the categorical features are then one-hot encoded. For the experiments, we removed 2,399 instances with missing data and used only the training data, which we split randomly for each trial run. The second dataset is the ProPublica recidivism dataset. It contains data from 6,167 individuals that were arrested. The data was collected when investigating the COMPAS risk assessment tool (Angwin et al., 2016). The task is to predict whether the person was rearrested within two years (y = 1 if they were rearrested, y = 0 otherwise). We again use the dataset with either race or gender as the sensitive attributes.

Balancing the Test Set
Any fairness method that is targeting demographic parity, treats the training set as defective in one way: the acceptance rates are not equal in the training set and this needs to be corrected. As such, it does not make sense to evaluate these methods on a dataset that is equally defective. Predicting at equal acceptance rates is the correct result and the test set should reflect this.
In order to generate a test set which has the property of equal acceptance rates, we subsample the given, imbalanced, test set. For evaluating demographic parity, we discard datapoints from the imbalanced test set such that the resulting subset satisfies P(s = j|y = i) = 1 2 for all i and j. This balances the set in terms of s and ensures P(y, s) = P(y)P(s), but does not force the acceptance rate to be 1 2 , which in the case of the Adult dataset would be a severe change as the acceptance rate is naturally quite low there. Using the described method ensures that the minimal amount of data is discarded for the Adult dataset. We have empirically observed that all fairness algorithms benefit from this balancing of the test set.
The situation is different for equality of opportunity. A perfect classifier automatically satisfies equality of opportunity on any dataset. Thus, an algorithm aiming for this fairness constraint should not treat the dataset as defective. Consequently, for evaluating equality of opportunity we perform no balancing of the test set.

Method
We evaluate two versions of our target label model 1 : FairGP, which is based on Gaussian Process models, and FairLR, which is based on logistic regression. We also train baseline models that do not take fairness into account.
In both FairGP and FairLR, our approach is implemented by modifying the likelihood function. First, the unmodified likelihood is computed (corresponding to P(ȳ = 1|x, θ )) and then a linear transformation (dependent on s) is applied as given by Equation (13). No additional ranking of the samples is needed, because the unmodified likelihood already supplies ranking information.
The fair GP models and the baseline GP model are all based on variational inference and use the same settings. During training, each batch is equivalent to the whole dataset. The number of inducing inputs is 500 on the ProPublica dataset and 2500 on the Adult dataset which corresponds to approximately 1 /8 of the number of training points for each dataset. We use a squaredexponential (SE) kernel with automatic relevance determination (ARD) and the probit function as the likelihood function. We optimize the hyper-parameters and the variational parameters using the Adam method (Kingma and Ba, 2015) with the default parameters. We use the full covariance matrix for the Gaussian variational distribution.
The logistic regression is trained with RAdam  and uses L2 regularization. For the regularization coefficient, we conducted a hyper-parameter search over 10 folds of the data. For each fold, we picked the hyper-parameter which achieved the best fairness among those 5 with the best accuracy scores. We then averaged over the 10 hyper-parameter values chosen in this way and then used this average for all runs to obtain our final results.
In addition to the GP and LR baselines, we compare our proposed model with the following methods: Support Vector Machine (SVM), Kamiran and Calders, 2012 ("reweighing" method), Agarwal et al., 2018 (using logistic regression as the classifier) and several methods given by Zafar et al. (2017a,b), which include maximizing accuracy under demographic parity fairness constraints (ZafarFairness), maximizing demographic parity fairness under accuracy constraints (ZafarAccuracy), and removing disparate mistreatment by constraining the false negative rate (ZafarEqOpp). Every method is evaluated over 10 repeats that each have different splits of the training and test set.

Results for Demographic Parity on Adult Dataset
Following Zafar et al. (2017b), we evaluate demographic parity on the Adult dataset. Table 1 shows the accuracy and fairness for several algorithms. In the table, and in the following, we use PR s=i to denote the observed rate of positive predictions per demographic group P(ŷ = 1|s = i). Thus, PR s = 0 /PR s = 1 is a measure for demographic parity, where a completely fair model would attain a value of 1.0. This measure for demographic parity is also called "disparate impact" (see e.g., Feldman et al., 2015;Zafar et al., 2017a). As the results in Table 1 show, FairGP ZafarAccuracy (Zafar et al., 2017b) 0.67 ± 0.17 0.808 ± 0.016 0.77 ± 0.08 0.853 ± 0.017 ZafarFairness (Zafar et al., 2017b) 0.81 ± 0.06 0.879 ± 0.009 0.74 ± 0.11 0.897 ± 0.004 Kamiran and Calders (2012) 0.87 ± 0.07 0.882 ± 0.007 0.96 ± 0.03 0.900 ± 0.004 Agarwal et al. (2018) 0.86 ± 0.08 0.883 ± 0.008 0.65 ± 0.04 0.900 ± 0.004 and FairLR are clearly fairer than the baseline GP and LR. We use the mean (PR avg t ) for the target acceptance rate. The difference between fair models and unconstrained models is not as large with race as the sensitive attribute, as the unconstrained models are already quite fair there. The results of FairGP are characterized by high fairness and high accuracy. FairLR achieves similar results to FairGP, but with generally slightly lower accuracy but better fairness. We used the two step procedure of Donini et al. (2018) to verify that we cannot achieve the same fairness result with just parameter search on LR.
In Figure 1, we investigate which choice of target (PR avg t , PR min t or PR max t ) gives the best result. We use PR avg t for all following experiments as this is the fairest choice (cf. section 3.2). The Figure 1A shows results from Adult dataset with race as sensitive attribute where we have PR min t = 0.156, PR max t = 0.267 and PR avg t = 0.211. PR avg t performs best in term of the trade-off. Figures 2A,B show runs of FairLR where we explicitly set a target acceptance rate, PR t : = P(ȳ = 1), instead of taking the mean PR avg t . A perfect targeting mechanism would produce a diagonal. The plot shows that setting the target rate has the expected effect on the observed acceptance rate. This tuning of the target rate is the unique aspect of the approach. This would be very difficult to achieve with existing fairness methods; a new constraint would have to be added. The achieved positive rate is, however, usually a bit lower than the targeted rate (e.g., around 0.15 for the target 0.2). This is due to using imperfect classifiers; if TPR and TNR differ from 1, the overall positive rate is affected (see e.g., Forman, 2005 for discussion of this). Figures 3A,B show the same data as Figure 2 but with different axes. It can be seen from this Figures 3A,B that the fairness-accuracy trade-off is usually best when the target rate is close to the average of the positive rates in the dataset (which is around 0.2 for both sensitive attribute).

Results for Equality of Opportunity on ProPublica Dataset
For equality of opportunity, we again follow Zafar et al. (2017a) and evaluate the algorithm on the ProPublica dataset. As we did for demographic parity, we define a measure of equality of opportunity via the ratio of the true positive rates (TPRs) within the demographic groups. We use TPR s=i to denote the observed TPR in group i: P(ŷ = 1|y = 1, s = i), and TNR s=i for the observed true negative rate (TNR) in the same manner. The measure is then given by TPR s = 0 /TPR s = 1 . A perfectly fair algorithm would achieve 1.0 on the measure.  Figures 5A,B show the achieved TPRs. In the accuracy-fairness plot, varying PR t is shown to produce an inverted U-shape: Higher PR t still leads to improved fairness, but at a high cost in terms of accuracy.
The latter two plots make clear that the TPR ratio does not tell the whole story: the realization of the fairness constraint can differ substantially. By setting different target PRs for our method, we can affect TPRs as well, where higher PR t leads to higher TPR, stemming from the fact that making more positive predictions increases the chance of making correct positive predictions. Figure 5 shows that our method can span a wide range of possible TPR values. Tuning these hidden aspects of fairness is the strength of our method.

DISCUSSION AND CONCLUSION
Fairness is fundamentally not a challenge of algorithms alone, but very much a sociological challenge. A lot of proposals have emerged recently for defining and obtaining fairness in machine learning-based decision making systems. The vast majority of academic work has focused on two categories of definitions: statistical (group) notions of fairness and individual notions of fairness (see Verma and Rubin, 2018 for at least twenty different notions of fairness). Statistical notions are easy to verify but do not provide protections to individuals. Individual notions do give individual protections but need strong assumptions, such as the availability of an agreed-upon similarity metric, which can be difficult in practice. We acknowledge that a proper solution to algorithmic fairness cannot rely on statistics alone. Nevertheless, these statistical fairness definitions can be helpful in understanding the problem and working toward solutions. To facilitate this, at every step, the trade-offs that are present should be made very clear and long-term effects have to be considered as well (Kallus and Zhou, 2018;Liu et al., 2018).
Here, we have developed a machine learning framework which allows us to learn from an implicit balanced dataset, thus satisfying the two most popular notions of fairness (Verma and Rubin, 2018), demographic parity (also known as avoiding disparate treatment) and equality of opportunity (or avoiding disparate mistreatment). Additionally, we indicate how to extend the framework to cover conditional demographic parity as well. The framework allows us to set a target rate to control how the fairness constraint is realized. For example, we can set the target positive rate for demographic parity to be 0.6 for different groups. Depending on the application, it can be important to specify whether non-discrimination ought to be achieved by more positive predictions or more negative predictions. This capability is unique to our approach and can be used as an intuitive mechanism to control the realization of fairness. Our framework is general and will be applicable for sensitive variables with binary and multi-level values. The current work focuses on a single binary sensitive variable. Future work could extend our tuning approach to other fairness concepts like the closely related predictive parity group fairness (Chouldechova, 2017) or individual fairness (Dwork et al., 2012).

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.

AUTHOR CONTRIBUTIONS
All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

FUNDING
This work was supported by the UK EPSRC project EP/P03442X/1 EthicalML: Injecting Ethical and Legal