Playing with Pipelines and Grid Search
In my last blog I wrote about how to create a simple logistic regression model. Since then, I’ve learned a few nifty tricks to streamline the process and help prevent mistakes from creeping into my models. Even better, this process doesn’t just work on logistic regression, but on other model types as well, such as random forest, XGBoost, etc.
Say I want to create a simple model. I’d have to clean my data, scale it, transform it, train it and potentially use feature selection, all in separate steps. Furthermore, say I want to run multiple models or try using different features; with all of these steps the chances of data leakage occurring increases. Thankfully, there’s a way to quickly and conveniently perform all of these steps — piplelines.
To provide an example of this process, I downloaded a simple dataset on heart disease from Kaggle. The dataset looks at 13 features that contribute to heart disease, which is indicated under the ‘target’ column. Additional information about the features can be found below.
- age
- sex
- cp: chest pain type (4 values)
- trestbps: resting blood pressure
- chol: serum cholestoral in mg/dl
- fbs: fasting blood sugar > 120 mg/dl
- restecg: resting electrocardiographic results (values 0,1,2)
- thalach: maximum heart rate achieved
- exang: exercise induced angina
- oldpeak: ST depression induced by exercise relative to rest
- slope: the slope of the peak exercise ST segment
- ca: number of major vessels (0–3) colored by flourosopy
- thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
The dataset contains 14 columns and 303 rows. There were no missing values and the datatypes were floats or integers.
First I imported the libraries I needed and read in my datasets
Then I ran a dummy classifier to get a baseline prediction. This step does not provide any insight about the data, but rather shows shows us the class distribution. In other words, the percentage of those with heart disease.
I created two iterations of my model, the second of which incorporates grid search in order to select the best parameters and perform cross validation.
First I created my features variable (X) and my target variable (y). After this, I split the data into training and testing sets. This is where pipelines come in handy, as I only have to do this once. As I mentioned in my previous blog, it’s extremely important to split the data before scaling in order to avoid data leakage.
At this point I did some preprocessing to account for the categorical columns in the dataset. While I’m not 100% certain this is correct, I assigned sex, cp (chest pain), restecg (resting electrocardiographic), exang (exercise induced angina), slope and thal to be my categorical columns and assigned the remaining features as my numerical columns.
First I defined the columns.
Then I created two separate pipelines fort he categorical and numerical columns. For the categorical column pipeline, I used SimpleImputer to replace any missing values with 0 (This was not necessary in this case because there were no missing values, but in general it’s a good idea to do this.) I also used OneHotEncoder to create dummy columns for my categorical features. For the numerical column pipeline, I again used SimpleImputer as well as StandardScaler to scale my data.
Next I fit and transformed the training data in the numerical columns.
And finally I ran ColumnTransformer on a list of tuples containing the names of my pipelines, the actual pipelines and the target columns and assigned this to a new variable called ‘preprocess’. ColumnTransformer is particularly useful because it allows different columns to be transformed separately and the features generated by each transformer to be concatenated.
Pipeline Without Grid Search
Now it’s time to create my pipeline using scikit-learn’s Pipeline module. Notice I used the preprocessed data here so there is no need to scale the data again..
After this I fit the training data to the pipeline.
And finally, I ran a prediction on my test set and generated a recall score. I chose recall over accuracy or precision because recall is concerned with false negatives — that is, how many people were not diagnosed with heart disease who actually had heart disease. While a false positive would be inconvenient and cause emotional distress, a false negative would lead an individual to believe they are healthy and therefore not seek proper medical treatment.
Pipeline With Grid Search
And now for the second pipeline integrating grid search. The first step is the same as before — create the pipeline.
Next I define my grid search parameters. In this case I have selected max_iter, which tells the model the maximum number of iterations to take for the solvers to converge. I also selected solver, which tells the model which algorithms to use.
After this I combined my logistic regression pipeline with my grid search parameters. In addition to this I specified how many cross validations (cv) I want my model to perform. Cross validation involves my model looking at different parts of the data for the training and test sets in order to make sure it has picked up on the patterns, which will minimize variance and bias. In this case, I told my model to do this 10 times.
Now I can fit my new pipeline with the grid search parameters.
Next I checked which parameters from my grid search worked best on my model. in this case it was a max_iter of 100 and the lbfgs (limited-memory-Broyden-Fletcher-Goldfarb_Shanno) algorithm
Again, I ran a prediction on my test set and generated a recall score. In this case, the recall score was the same as without grid search, so in the end, there was no need to use grid search here.
Optional Step
An additional step one could take to further streamline the process would be to create a function to evaluate the model, display a confusion matrix and plot the ROC curve.
First I passed the training data through the function.
Then I passed my test data through the function. Note that the recall score for the test set is only marginally lower compared to the training set and the AUC is the same.
And that’s how you create a model pipeline and integrate it with grid search!