# Overview

This notebook allows you to walk through various stages of the ML pipeline for the [COMPAS dataset](http://www-student.cse.buffalo.edu/~atri/ml-and-soc/support/notebooks/datasets/cox-violent-parsed_filt.csv).

# Acknowledgement

The code in this notebook was originally created by [Sanchit Batra](https://www.linkedin.com/in/sanchitbatra/). 

The [COMPAS dataset](http://www-student.cse.buffalo.edu/~atri/ml-and-soc/support/notebooks/datasets/cox-violent-parsed_filt.csv) generated by [ProPublica](https://github.com/propublica/compas-analysis). This specific version is taken from [Kaggle](https://www.kaggle.com/danofer/compass#cox-violent-parsed_filt.csv), which in turn got the original data from [ProPublica](https://github.com/propublica/compas-analysis).

## Loading the dataset

The first step is to load the `csv` file for the [COMPAS dataset](http://www-student.cse.buffalo.edu/~atri/ml-and-soc/support/notebooks/datasets/cox-violent-parsed_filt.csv).

Just click the Run button.

In [2]:
#@title Load the dataset { display-mode: "form" }
import pandas as pd
import ipywidgets as widgets

#Load the CSV file
dataset = pd.read_csv('http://www-student.cse.buffalo.edu/~atri/algo-and-society/support/notebooks/datasets/cox-violent-parsed_filt.csv')

#display the CSV file
display(dataset)

Unnamed: 0,id,name,first,last,sex,dob,age,age_cat,race,juv_fel_count,decile_score,juv_misd_count,juv_other_count,priors_count,days_b_screening_arrest,c_jail_in,c_jail_out,c_days_from_compas,c_charge_degree,c_charge_desc,is_recid,r_charge_degree,r_days_from_arrest,r_offense_date,r_charge_desc,r_jail_in,violent_recid,is_violent_recid,vr_charge_degree,vr_offense_date,vr_charge_desc,type_of_assessment,decile_score.1,score_text,screening_date,v_type_of_assessment,v_decile_score,v_score_text,priors_count.1,event
0,1.0,miguel hernandez,miguel,hernandez,Male,18/04/1947,69,Greater than 45,Other,0,1,0,0,0,-1.0,13/08/2013 6:03,14/08/2013 5:41,1.0,(F3),Aggravated Assault w/Firearm,0,,,,,,,0,,,,Risk of Recidivism,1,Low,14/08/2013,Risk of Violence,1,Low,0,0
1,2.0,miguel hernandez,miguel,hernandez,Male,18/04/1947,69,Greater than 45,Other,0,1,0,0,0,-1.0,13/08/2013 6:03,14/08/2013 5:41,1.0,(F3),Aggravated Assault w/Firearm,0,,,,,,,0,,,,Risk of Recidivism,1,Low,14/08/2013,Risk of Violence,1,Low,0,0
2,3.0,michael ryan,michael,ryan,Male,6/2/85,31,25 - 45,Caucasian,0,5,0,0,0,,,,,,,-1,,,,,,,0,,,,Risk of Recidivism,5,Medium,31/12/2014,Risk of Violence,2,Low,0,0
3,4.0,kevon dixon,kevon,dixon,Male,22/01/1982,34,25 - 45,African-American,0,3,0,0,0,-1.0,26/01/2013 3:45,5/2/13 5:36,1.0,(F3),Felony Battery w/Prior Convict,1,(F3),,5/7/13,Felony Battery (Dom Strang),,,1,(F3),5/7/13,Felony Battery (Dom Strang),Risk of Recidivism,3,Low,27/01/2013,Risk of Violence,1,Low,0,1
4,5.0,ed philo,ed,philo,Male,14/05/1991,24,Less than 25,African-American,0,4,0,1,4,-1.0,13/04/2013 4:58,14/04/2013 7:02,1.0,(F3),Possession of Cocaine,1,(M1),0.0,16/06/2013,Driving Under The Influence,16/06/2013,,0,,,,Risk of Recidivism,4,Low,14/04/2013,Risk of Violence,3,Low,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18311,,alexsandra beauchamps,alexsandra,beauchamps,Female,21/12/1984,31,25 - 45,African-American,0,6,0,0,5,-1.0,28/12/2014 10:14,7/1/15 11:42,1.0,(M1),Battery,0,,,,,,,0,,,,Risk of Recidivism,6,Medium,29/12/2014,Risk of Violence,4,Low,5,0
18312,,winston gregory,winston,gregory,Male,1/10/58,57,Greater than 45,Other,0,1,0,0,0,-1.0,13/01/2014 5:48,14/01/2014 7:49,1.0,(F2),Aggravated Battery / Pregnant,0,,,,,,,0,,,,Risk of Recidivism,1,Low,14/01/2014,Risk of Violence,1,Low,0,0
18313,,farrah jean,farrah,jean,Female,17/11/1982,33,25 - 45,African-American,0,2,0,0,3,-1.0,8/3/14 8:06,9/3/14 12:18,1.0,(M1),Battery on Law Enforc Officer,0,,,,,,,0,,,,Risk of Recidivism,2,Low,9/3/14,Risk of Violence,2,Low,3,0
18314,,florencia sanmartin,florencia,sanmartin,Female,18/12/1992,23,Less than 25,Hispanic,0,4,0,0,2,-2.0,28/06/2014 12:16,30/06/2014 11:19,2.0,(F3),Possession of Ethylone,1,(M2),0.0,15/03/2015,Operating W/O Valid License,15/03/2015,,0,,,,Risk of Recidivism,4,Low,30/06/2014,Risk of Violence,4,Low,2,0


## Cleaning the dataset

Datasets that you get rarely are in the form that can be used directly in generating ML models. In real-life there are bunch of inaccuracies, data entry mistakes and so on.

For the COMPAS dataset the only real change we have to make is to convert certain [categorical variables](https://en.wikipedia.org/wiki/Categorical_variable) into nuemrical ones. Click the run button to do this for the loaded dataset.

In [3]:
#@title Clean the dataset { display-mode: "form" }
from sklearn.preprocessing import LabelEncoder
import numpy as np


"""
Clean up the dataset.
Currently ONLY works for COMPAS
"""

def clean_dataset(df: pd.DataFrame):

  le = LabelEncoder()

  # Filter rows with values outside of range
  df = df[df['is_recid'] != -1]
  df = df[df['decile_score.1'] != -1]
  df = df[df['decile_score'] != -1]


  #Re-name the recidivism variable
  df['recidivism_within_2_years'] = df['is_recid']
  #print("After 4th change========")
  #display(df)
    
  # Categorize variables
    
  le.fit(df['race'])
  df['race'] = le.transform(df['race'])
    
  le.fit(df['age_cat'])
  df['age_cat'] = le.transform(df['age_cat'])
    
  le.fit(df['v_score_text'])
  df['v_score_text'] = le.transform(df['v_score_text'])
    
  df['score_text'] = np.where(df['score_text'] == 'Low', 0, 1)
    
  df["sex"] = np.where(df["sex"] == "Male", 0, 1)

  return df

#Clean the dataset

dataset = clean_dataset(dataset)

#display(dataset)


## Chossing the test/train split

Now that you have cleaned the COMPAS dataset you need to decide what fraction of the COMPAS dataset you will use for the testing dataset. 

First click the Run button and _then_ pick the fraction of the dataset that will be used as the test dataset. (The testing dataset is picked randomly and the rest of the COMPAS dataset forms the training data set.)

In [4]:
#@title Choose the test/train split
test_split_widget=widgets.FloatSlider(
    value=0.2,
    min=0,
    max=1.0,
    step=0.1,
    description='Test split:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='.1f',
)

display(test_split_widget)

FloatSlider(value=0.2, continuous_update=False, description='Test split:', max=1.0, readout_format='.1f')

## Check your input

Click the run button to make verfiy that your test vs train split was input correctly.

In [5]:
#@title Check your input { display-mode: "form" }
test_split_frac = float(test_split_widget.value)

print("You have chosen the test set to be "+str(test_split_frac*100)+"% of the dataset")

You have chosen the test set to be 10.0% of the dataset


## Split dataset into testing and training datasets

Click the Run button to split your dataset into testing and training sets.

In [6]:
#@title Split the dataset into training and testing sets { display-mode: "form" }
from sklearn.model_selection import train_test_split

#split the dataset into training/testing (currently test set is 20% of the dataset)
training_set, test_set = train_test_split(dataset, test_size=test_split_frac)

#display(training_set)
#display(test_set)

## Choose target variable

First click the Run button and _then_ choose the target variable that you would like your ML model to predict.

In [7]:
#@title Choose target variable { display-mode: "form" }

columns = [
    "sex", 
    "race",
    "age",
    "age_cat",
    "score_text",
    "priors_count.1", 
    "priors_count", 
    "juv_fel_count", 
    "juv_misd_count", 
    "juv_other_count", 
    "is_violent_recid",
    "decile_score.1",
    "v_decile_score",
    "v_score_text",
    "recidivism_within_2_years"
]

"""
Choose the column being predicted (i.e. target variable)
"""

target_choice = widgets.RadioButtons(
    options=columns,
    value='recidivism_within_2_years',
    description='Target variable choice:',
    disabled=False
)

display(target_choice)


RadioButtons(description='Target variable choice:', index=14, options=('sex', 'race', 'age', 'age_cat', 'score…

## Check your input

Click the run button to make verfiy that your test vs train split was input correctly.

In [8]:
#@title Check your choice { display-mode: "form" }
target_var = target_choice.value

print("You choose the follwing target variable:  "+target_var)

You choose the follwing target variable:  recidivism_within_2_years


## Choose input variables for prediction

Choose among the following input variables (which are a _subset_ of the columns in the COMPAS dataset) that will be used in your ML model to predict the target variable.

**Note**: You should choose the target variable as one of your input variables. The notebook will flash an error message if you do this but will not prevent you from running the rest of the steps in the notebook.

In [9]:
#@title Choose your input variables { display-mode: "form" }

input_choice=widgets.SelectMultiple(
    options=columns,
    value=['race','age'],
    #rows=10,
    description='Input Variable(s)',
    disabled=False
)

display(input_choice)

SelectMultiple(description='Input Variable(s)', index=(1, 2), options=('sex', 'race', 'age', 'age_cat', 'score…

## Check your input

Click the run button to make verfiy that your test vs train split was input correctly.

In [10]:
#@title Check your choice { display-mode: "form" }
input_columns= list(input_choice.value)

if target_var in input_columns:
  print("The target variable is part of the input variables. DO NOT PROCEED!!!")
else:
  print("You chose the following input variables:  "+str(input_columns))

You chose the following input variables:  ['sex', 'race', 'age', 'age_cat', 'score_text', 'priors_count.1', 'priors_count', 'juv_fel_count', 'juv_misd_count', 'juv_other_count', 'is_violent_recid', 'decile_score.1', 'v_decile_score', 'v_score_text']


## Update the training and testing sets

Drop the columns from the training and testing sets that are not among those you chose above.

In [11]:
#@title Create the final training and testing sets { display-mode: "form" }
#Get the training input values and labels
train_inputs = training_set[input_columns].to_numpy()
train_labels = training_set[target_var].to_numpy()

#display(train_input)
#display(train_label)

#Get the testing input values and labels
test_inputs = test_set[input_columns].to_numpy()
test_labels = test_set[target_var].to_numpy()


## Choose the model class

Choose among the following three possibilities of the model class.

Remember to click Run first and _then_ choose you model class.

In [12]:
#@title Choose the model class { display-mode: "form" }
model_choice = widgets.RadioButtons(
    options=['Linear Classifier','Decision Tree','Neural Networks'],
    value='Linear Classifier',
    description='Model class choice:',
    disabled=False
)

display(model_choice)

RadioButtons(description='Model class choice:', options=('Linear Classifier', 'Decision Tree', 'Neural Network…

## Check your input

Click the run button to make verfiy that your test vs train split was input correctly.

In [17]:
#@title Check your choice { display-mode: "form" }
model_class = model_choice.value

print("You have chosen the following model class: "+model_class)

You have chosen the following model class: Neural Networks


## Check the accuracy

Click the run button to see how accurate is your model (that was trained on the training set) from the class you have chosen on the testing set.

In [18]:
#@title Finish it off! { display-mode: "form" }
# Import stuff for linear classifier
from sklearn.linear_model import SGDClassifier
LinearClassifier = SGDClassifier

# Import stuff for decision trees
from sklearn import tree

# Import stuff for NNs
from sklearn.neural_network import MLPClassifier

# Import stuff for accuracy
from sklearn.metrics import accuracy_score

###################################################################################################################
# NOTE TO STUDENTS: Start reading from here
###################################################################################################################
try:
    if model_class == 'Linear Classifier':
      model = LinearClassifier()
    elif model_class == 'Decision Tree':
      model = tree.DecisionTreeClassifier()
    elif model_class == 'Neural Networks':
      #model = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(5, 2), random_state=1)
      model = MLPClassifier(random_state=1, max_iter=1000)

    # Train the model using the label column selected on the training dataset without the label column
    model.fit(train_inputs, train_labels)
    
    # Predict labels using the testing dataset without the label column
    predictions = model.predict(test_inputs)
    
    # Use the number of mismatched labels as a measure of the accuracy for the classification task
    score = accuracy_score(test_labels, predictions)
    
    print("Accuracy score: {}%".format(str(round(score*100, 2))))
    
except ValueError as e:
    print(e)

Accuracy score: 68.94%
