COMPAS dataset

This page contains a quick overview of a dataset that we will use often in this course. We also use to do a walkthrough of the ML pipeline.

Model training

Once we have figured out the training dataset and out target model class, the next step is to figure out the "best" model from this target class that best "explains" the training dataset.

This step of the ML pipeline is probably the most mathematically involved when compared to the other steps (there is some beautiful math in there though!) so we will mostly avoid talking about this in detail. We will talk about this step in bit more detail later on in the course but for now

Model training

Assume that your group has access to a blackbox that figures out the best possible model that separates the positive and negative data points in your training set. E.g. here is what the learned linear model on the above training set will look like:

Responsive image

Predict on test data

Now that you have your model that was learned, you have to figure out a test dataset to evaluate how good (or bad) your learned model is. We will leave the evaluation to next step but given a model it is easy to figure what the label of a datapoint in the test dataset will be (for linear model its just a matter of figuring out which side of the line the test datapoints lie on). There are various possibilities here but here is the obvious one:

Predict on test data

Your group decides to pick the $50\%$ of the original dataset that was not included in the training set, i.e. (below we leave the learned linear model in the plot):

Responsive image

Note that everything below the line is considered to be labeled positive and everything above the line is considered to be labeled negative (so in particular, the red point below the line and the green point above the line are not classified correctly).

Evaluate error

Now that we have used our learned model to predict the labels in the test dataset we need to evaluate the error. Again, there are multiple mathematically precise measures of evaluating error. For this example, we will again go with the obvious one:

Evaluate error

Your group decided to calculate the percentage of points in the testing dataset that were mis-classified. So in the example above:

Responsive image

The mis-classified points are colored in black and so we have $2$ mis-classified points, which is an error rate of $\frac 26\approx 33\%$.

Deploy!

Now your group is ready to deploy the learned model. Or is it?

Independence of ML pipeline steps from the problem

You might have noticed that once we fix the data representation all of the remaining steps have procedures that are actually independent of the original problem you started off with. This is what makes ML powerful (since the same abstraction works in multiple scenarios) but also leads to pitfalls/blindspots (as we will see later in the course).

Next Up

In the next lecture, we will consider steps in the ML pipeline related to data collection.