Using NumPy Vectorization For Looping
As a baby data science student, I’ve had to write a lot of for loops. The problem is, I hate writing for loops. I hate the syntax, I hate that they span multiple lines, and I hate loop variables. One alternative is Python’s much-vaunted list comprehension, but those can be tricky sometimes too. I thought I was doomed to write for loops for the rest of my life until I learned about vectorization. This method turned out to be the solution to all my woes.
The main reason I like vectorization is because it’s easy to use. The syntax is clear and easy to pick up. It lines up better with my thought process and feels more natural to use. I’ll admit, I’m not the world’s greatest coder yet, so I welcome anything that simplifies the process.

Vectorization doesn’t just make things easier, it speeds things up considerably as well — often by orders of magnitude. Something that could take 15 seconds in a for loop can take 15 milliseconds using vectorization. The speed gains that can be achieved actually seem like a joke at first. I’m glad I now have this powerful tool at my disposal for when I have to work on large datasets.
Vectorization works with Pandas and NumPy and this makes sense, because just like these two libraries, it makes your code simpler. I remember how excited I was when we first started coding in pandas. Suddenly I could produce charts or perform arithmetic on columns in one line of code. What I didn’t know at the time was that common tools in Pandas like groupby, filter and pd.to_datetime() all use vectorization.
In base python, the time it takes to perform a for loop is directly proportional to the number of elements you’re iterating over multiplied by the time it takes to do each operation on each element. Instead of working on each element separately, row by row, vectorization works by operating on an array or series all at once using NumPy, which pushes the calculations from Python to C, which in turn is able perform operations on the datatype at much higher speeds. This is a win for those who are new to coding because we don’t have to learn C!
Since I learned how to replace for loops with vectorization, I have been using them almost exclusively. Specifically, I’ve been using the numpy.where() method which allows you to vectorize an if/else statements. Here’s a run down of the syntax using


Basically there are three components. First you enter your conditional statement. Second, you enter what you want to happen if the condition is true. Lastly, you enter what you want to happen if the condition is false.
Recently, I used this method to clean up a dataset from Kaggle on housing prices in King’s County, Washington. There were two columns, one containing information for the year a house was built, and one containing information for the year a house was renovated. I wanted to calculate the amount of time that had passed between a house being built and being renovated. The problem was that if the house was not renovated, the year was listed as zero, which ruined my calculation.

In order to get around this, I wanted to change all of the zeros in the years renovated column to match the values in the year built column. I was able to accomplish this with the simple line of code below.

After running this loop, I updated the ‘yr_renovated’ column in the dataframe.

After updating the column, my dataframe looked like this.

After this doing this, I was then able to perform my calculations. Now were these calculations useful to me? Not particularly. But I was able to do them gosh darnit!