Open In Colab

De-biasing predictors: a tutorial

v0.1 (21.05.2021)

(c) Ali Zaidi

What this tutorial is

A heuristic introduction to removing "unsavoury" features from datasets to avoid using them for making predictions.

What this tutorial is not

A rigorous meta-analysis of the biases that arise in datasets, the sociology of data collection, a statement on fairness or how to measure it, ensure it, etc.

DISCLAIMER: I did not discover this bias in the dataset, nor did I invent this technique of cleaning data. A brief search on the internet will reveal the vast amounts of work related to this. I'm using this dataset merely for illustratives purposes. This work is original as far as the code and text are concerned. That is to say, I wrote this piece myself and haven't copied code, text or metrics from anywhere else. Howver, this post is inspired by Vincent Warmerdam's work.

Background

One of the founding fathers of modern NeuralNets and AI, Prof. Yan LeCunn, raised a twitter storm a little over a year ago over a tweet on bias in algorithms. In essense, the statement was: algorithms themselves aren't biased, data is biased.

A major debate raged on regarding who is to blame and/or responsible for ensuring "fairness" in AI-based systems, however fairness is defined or understood.

It is a fact that biases are commonplace in our society, whether deliberate or accidental. In case you're interested, there's an extensive list on Wikipedia.

Since data collection might be subject to some of these biases, they naturally become part of the data. As a responsible data scientist, it is import to understand the variables we are working with, where they come from, what they mean, and how they affect our predictions.

As an example, we'll be checking out a very common dataset that most people come across very early on in their data science journeys: the Boston House price dataset.

Almost everyone who's played with regression would have come across this dataset, where the objective is to predict the price of a house based on various attributes regarding the house and neighbourhood.

We'll start by having a quick look at the input features...

First things first, inspect the keys in the data object:

"feauture_names" seems promising. Let's have a quick look at the features.

Ok. Well that's an abbreviated list. Not very informative. Let's try something else...

"DESCR" seems like a description. Cool.

Now that is informative!

I'd like to point your attention to the last two features, though: >

B: The proportion of blacks by town.

LSTAT: % Lower status of the population

Erm. That's a little awkward and uncomfortable. Should we really be using these as features for making predictions? Ideally we don't want to.

Maybe we can start by seeing how they affect our target variable.

But before that, a tiny bit preprocessing and cleaning of the data, to make life easier. :)

Great!

Moving on to the issue in question: the questionable columns...

The overall trend suggests that the price drops with increasing proportion of 'low-income' families.

Let's have a look at the relationship of 'B' with price

The devil is in the details

A positive trend? Not, bad! The price seems to increase with 'B'. So does that mean that the higher the proportion of 'black' residents, the higher the price?

Nope! This variable is a little more cryptic. Let's revisit the description:

...

- B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

...

So, they're taking the difference between a value and 0.63 and then squaring it.

Which means, the further the value deviates from 63%, the higher the score will be, irrespective of whether the actual value increases or decreases.

Meaning higher values could very well indicate a much lower proportion of black residents. This is what we sometimes call a proxy-variable.

These kinds of scenarios pop up often in real world data, and we need to be very careful while interpreting input variables.

Linear Regression on the raw data

We're now going to fit a simple linear model to our dataset, and use the mean squared error as a measure of 'loss'.

Good statical practices would demand a separation of trainig and test sets, to ensure proper validation. However, the purpose of this exercise is to study and evaluate bias, so we will be a little lapse regarding validation here and use the entire dataset. Bad practice in general, but since we're interested in another metric, it will be good enough for now.

Measuring bias

So far so good. Any tutorial on linear regression would be wrapping up right about now, but this is where our journey actually begins.

While thinking about bias (or fairness), a difficult task is to identify a proper measure to quantify it. The right metric is a contentious subject in itself, and definitely beyond the scope of this tutorial.

Since we're short on time, and mildly irresponsible, we need to care about a measure of bias, instead of the measure of bias.

For example, one way to measure bias would be to look at the differences between the model predictions for houses with low prices versus high prices. If the model performance differs, we will consider our model biased.

As a proxy for the amount of bias, let's look at the difference between the predictions for low vs high priced houses. However, as we are going to be playing around with the features (by modifying or removing them), we need a way to measure the bias relative to the modified features. A simple way to accomplish that is by normalizing the difference of low vs high by the overall prediction.

In short, we're looking at whether higher priced houses have a predictive advantate over lower priced houses, and we're normalizing that to the overall prediction to enable comparisions across various methods of "tweaking" our dataset.

The ideal case is to ensure "fair" predictions by having the model perform equally well for both types of houses.

Why this metric? Ideally, you'd want to measure the effect of the "biased" features on your dataset. But then how do we do that if we intend to remove them? We therefore need to find an aspect within our target variable to be able to calculate the bias. The one feature we have is the price of the house, and if we think deeply, the price is a proxy for economic disparity, which we are using a proxy for bias here. Again, this is only for illustrative purposes.

Bootstrapping for statistics

Now that we have a measure for bias, we'd like to get some statistics in order to compare different methods of De-biasing.

To do that, we can do a simple bootstrap (sampling with replacement), and get a distribution of bias scores.

We'll be doing this exercise A LOT. Modifying the dataset, fitting a regression line, bootstrapping the bias and ploting the results.

Writing these methods in a class is going to make life super easy.

The class below has all the things we've done above: namely, fitting a line, calculating MSE, bootstrapping and plotting. It inherits from the sklearn LinearRegression class, and adds bootstrapping and plotting to the functionality.

Awesome! Now we're equipped to evaluate the effects of our interventions on the training data.

Idea 1: Being politically correct (bias avoidance)

Our first idea is not to use the questionable features for constructing our model. So we'll just drop them. No bias in, no bias out, right?

The class we wrote above will help us test and quantify our hypothesis.

We can see we've gained some error but lost some bias. That's good. But our bias is still kind of high.

The question is, when we removed the columns, did also remove the information from the columns?

In case the other columns are correlated to the ones we dropped, we're in trouble, and simply dropping them won't do us much good.

How do we check for that? A simple correlation analysis would do. A complicated way is to use Linear regression again! If we can express our dropped columns as a linear combination of our politically correct columns, that means the information is still there!

Have a look at the $R^2$ values. Theres definitely an information leak:

SVD to the rescue?

Someone familiar with linear algebra might say, "Use an SVD. That orthogonalizes the columns"

It decomposes the data matrix D into three matrices:

$$ D = U \Sigma V^{T}$$

Where $U$ is an orthonormal set of vectors.

This should orthogonalize out the unwanted data, shouldn't it?

Here's a test

Absolutely no change in our results!

Bummer!

Removing columns doesn't remove information when the columns are codependent. Even if we do an SVD, the information still remains with high fidelity.

Along with removing the features, we need to remove the information of the features from the dataset. Then can we truly call it 'clean'!

How do you remove the information of codependent features? Linear Algebra of course. A simple process called orthogonalization.

In essense, what you're doing is making the clean data orthogonal to the features you'd like to remove.

The Gram-Schmidt process

For a detailed explanation, check out the Wikipedia article here.

But for our purposes this explanation should suffice.

The process is pretty simple: Subtract the projections of the unwanted columns from the rest to make clean ones.

The formula for orthogonalizing two vector is as follows:

$$ v = a - (b \cdot \frac{<a,b>}{||a||\cdot||b||} )$$

where $<a,b>$ is the inner product of a and b.

Let's break it down.

Project b onto a :

$$ Proj_{a}(b)= a \cdot \frac{<a,b>}{||a||\cdot||b||} $$

We can then subtract this from b. This is effectively equivalent to walking along b, and then walking along $-Proj_{a}(b)$.

Here's an illustration to show the process:

GramSchmidt1.png

Let's write a small script to orthogonalize our features in two simple steps.

For each column, remove the projection of 'B' and then remove the projection of 'LSTAT'.

Wait, what?! Our code snippet above clearly removes the projections of two unwanted columns one after the other from the remaining columns.

That's not it. There's something wrong with the orthogonalization.

Again, the devil is in the details.

We did the process sequentially. However the orthonormal vectors are supposed to be generated recursively. This is an extremely and often painfully common error during Gram-Schmidt. And we have to been very careful about this!

An illustration of the right and wrong methods is below:

In brief: Sequentially: wrong; Recursively: right

GramSchmidt2.png

Correcing our code to enable orthogonalization the proper way, let's take another hack at it.

Yippie! We've removed the unwanted information from our features!

Time to see how the regression model performs

Awesome! We've effectively removed the difference in the accuracy of predictions between low and high priced houses! So this goes to show that our input features did affect the bias of our model, and that it is possible to identify and remove those features with a little bit of effort. Ok maybe a slightly involved amount of effort. But it is possible.

However! We have sacrificed accuracy (MSE) to gain fairness. This is an important lesson: fairness has a price.

The obvious question is, can we find a balance between model peformance and fairness?

Of course! And there are many, many ways to do that. A super simple idea is to predict using both the biased and fair models, and take a weighted average of their predictions.

Let's have a look at how that might work:

We can now find an acceptable balance between model peformance and fairness.