Problem specific parts of the ML pipeline

This page talks in more details about the first six steps in the ML pipeline. These are the steps in the ML pipeline that are most closely related to the problem being solved.

Data representation: Example 1

Your group has zeroed in on query, ad and clicks. For the latter perhaps the most natural way to represent this to encode whether a user clicked on ad or not (so either $+$ for clicked and $-$ for not clicked or $1$ for clicked and $0$ for not clicked. The representation for query and the ad is not as straightforward. We could store the exact text for the query and the ad but that seems to indicate issues (e.g. what is you ad text are distinct strings but are essentially the "same" for human consumption or what if someone runs a query that has the same keywords as another query but in different order). To get around these issues of using the text as is, your group decides to use a representation that is more standard in natural language processing: bag of words model .

Data representation: Example 2

In this case since your group is using the electronic health records, then the data representation is pretty much already fixed for your group. Perhaps one exception could be to represent the doctor's notes in the bag of words model as above.

Data representation: In-class discussion

To be done in class.

Data representation: General thoughts

More generally, when creating datasets for learning systems you convert "raw" data into specific representations. The primary reason to do this is that there can be a lot of variation in "raw data" but using a slightly coarser representation (that technically "loses" some information) is more useful to compare various entries in your dataset. As we saw above for the first example, even though one could store the raw text of the query doing so makes things harder down the line and so we decided to use the bag of words model to represent the query.

Sometimes the representation could be mandated by your data collection mechanism. E.g. in electronic health records, diseases, symptoms and so on are represented by diagnosis code . Sometimes, the representation could depend on the instrument used to measure something (e.g. temperatures in medical records are not recorded to arbitrary precision but based on the precision of the thermometers used. If using online surveys, the data representation would depend on how the responses are collected-- e.g. are they free-form text or are they input as a check=list. In the latter case there are only fixed number of possible values (and this in statistics is referred to as a categorical data/variable .

A digression: A Jupyter notebook exercise

Before we move on, let's use Jupyter notebook to get a sense for how which data you collect can affect your accuracy at the end:

Load the notebook

Log on to Google Colab . Then download the Choosing Input Variables notebook from the notebooks page (here is a direct link). Load the notebook into Google Colab , which would look like this:

Responsive image

Ignore most of the notebook

The notebook trains a linear model on the COMPAS dataset but you can ignore most of the notebook safely for now. Pay attention to two things:

Play around with removing/adding various input variables/column names and see how the accuracy changes at the bottom of the notebook.

Increase the accuracy

Can you pick a set of input variables so that the accuracy gets to as close to $100\%$ as possible?

BONUS: Decrease the accuracy

Pick a subset of eight variables so that the accuracy is as close to $0\%$ as possible. The submission with the lowest accuracy will get 5 bonus points. More (submission) details on the Bonus page.

Do not forget to run each cell in sequence

A common way to get an error message is to run cell out of order. When in doubt, start with the first cell and then run all of them in sequence.

Next Up

Next, we will looks at some well-known (and widely used) machine learning models (i.e. dive deeper into the seventh step of the ML pipeline).