Playing with Pipelines and Grid Search

Adam Cumurcu
6 min readJun 30, 2021

In my last blog I wrote about how to create a simple logistic regression model. Since then, I’ve learned a few nifty tricks to streamline the process and help prevent mistakes from creeping into my models. Even better, this process doesn’t just work on logistic regression, but on other model types as well, such as random forest, XGBoost, etc.

Say I want to create a simple model. I’d have to clean my data, scale it, transform it, train it and potentially use feature selection, all in separate steps. Furthermore, say I want to run multiple models or try using different features; with all of these steps the chances of data leakage occurring increases. Thankfully, there’s a way to quickly and conveniently perform all of these steps — piplelines.

To provide an example of this process, I downloaded a simple dataset on heart disease from Kaggle. The dataset looks at 13 features that contribute to heart disease, which is indicated under the ‘target’ column. Additional information about the features can be found below.

  1. age
  2. sex
  3. cp: chest pain type (4 values)
  4. trestbps: resting blood pressure
  5. chol: serum cholestoral in mg/dl
  6. fbs: fasting blood sugar > 120 mg/dl
  7. restecg: resting electrocardiographic results (values 0,1,2)
  8. thalach: maximum heart rate achieved
  9. exang: exercise induced angina
  10. oldpeak: ST depression induced by exercise relative to rest
  11. slope: the slope of the peak exercise ST segment
  12. ca: number of major vessels (0–3) colored by flourosopy
  13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

The dataset contains 14 columns and 303 rows. There were no missing values and the datatypes were floats or integers.

First I imported the libraries I needed and read in my datasets

Import libraries
Read in dataset

Then I ran a dummy classifier to get a baseline prediction. This step does not provide any insight about the data, but rather shows shows us the class distribution. In other words, the percentage of those with heart disease.

Run dummy classifier

I created two iterations of my model, the second of which incorporates grid search in order to select the best parameters and perform cross validation.

First I created my features variable (X) and my target variable (y). After this, I split the data into training and testing sets. This is where pipelines come in handy, as I only have to do this once. As I mentioned in my previous blog, it’s extremely important to split the data before scaling in order to avoid data leakage.

Create a features variable and target variable, then perform train test split.

At this point I did some preprocessing to account for the categorical columns in the dataset. While I’m not 100% certain this is correct, I assigned sex, cp (chest pain), restecg (resting electrocardiographic), exang (exercise induced angina), slope and thal to be my categorical columns and assigned the remaining features as my numerical columns.

First I defined the columns.

Define categorical columns
Define numerical columns

Then I created two separate pipelines fort he categorical and numerical columns. For the categorical column pipeline, I used SimpleImputer to replace any missing values with 0 (This was not necessary in this case because there were no missing values, but in general it’s a good idea to do this.) I also used OneHotEncoder to create dummy columns for my categorical features. For the numerical column pipeline, I again used SimpleImputer as well as StandardScaler to scale my data.

Create pipelines for cateogorical and numerical columns

Next I fit and transformed the training data in the numerical columns.

Fit and transform the training data in the numerical columns

And finally I ran ColumnTransformer on a list of tuples containing the names of my pipelines, the actual pipelines and the target columns and assigned this to a new variable called ‘preprocess’. ColumnTransformer is particularly useful because it allows different columns to be transformed separately and the features generated by each transformer to be concatenated.

Run ColumnTransformer on my categorical and numerical pipelines in preparation for using them in my model

Pipeline Without Grid Search

Now it’s time to create my pipeline using scikit-learn’s Pipeline module. Notice I used the preprocessed data here so there is no need to scale the data again..

After this I fit the training data to the pipeline.

Fit the training data to the pipeline

And finally, I ran a prediction on my test set and generated a recall score. I chose recall over accuracy or precision because recall is concerned with false negatives — that is, how many people were not diagnosed with heart disease who actually had heart disease. While a false positive would be inconvenient and cause emotional distress, a false negative would lead an individual to believe they are healthy and therefore not seek proper medical treatment.

Predict on test set and generate recall score

Pipeline With Grid Search

And now for the second pipeline integrating grid search. The first step is the same as before — create the pipeline.

Create a logistic regression pipeline (same as before)

Next I define my grid search parameters. In this case I have selected max_iter, which tells the model the maximum number of iterations to take for the solvers to converge. I also selected solver, which tells the model which algorithms to use.

define grid search parameters

After this I combined my logistic regression pipeline with my grid search parameters. In addition to this I specified how many cross validations (cv) I want my model to perform. Cross validation involves my model looking at different parts of the data for the training and test sets in order to make sure it has picked up on the patterns, which will minimize variance and bias. In this case, I told my model to do this 10 times.

Combine logistic regression pipeline with grid search parameters

Now I can fit my new pipeline with the grid search parameters.

Fit new pipeline with grid search parameters

Next I checked which parameters from my grid search worked best on my model. in this case it was a max_iter of 100 and the lbfgs (limited-memory-Broyden-Fletcher-Goldfarb_Shanno) algorithm

Display best parameters

Again, I ran a prediction on my test set and generated a recall score. In this case, the recall score was the same as without grid search, so in the end, there was no need to use grid search here.

Predict on test set and generate recall score

Optional Step

An additional step one could take to further streamline the process would be to create a function to evaluate the model, display a confusion matrix and plot the ROC curve.

Create a function to evaluate the model, display a confusion matrix and plot the ROC curve.

First I passed the training data through the function.

Pass training data through function

Then I passed my test data through the function. Note that the recall score for the test set is only marginally lower compared to the training set and the AUC is the same.

Pass test data through function

And that’s how you create a model pipeline and integrate it with grid search!

--

--

Adam Cumurcu

I'm currently a data science student at Flatiron School