Numba is a very commonly used library nowadays to speed up computations in Python code. It let us speed up our code by just decorating them with one of the decorators it provides and then all the speed up will be handled by it without the developers’ need to worry. The majority of the time numba decorated functions work quite faster compared to normal python functions. Numba is designed to speed up even numpy code as well.
Though Numba can speed up numpy code, it does not speed up code involving pandas which is the most commonly used data manipulation library designed on the top of numpy. We have already created a tutorial where we introduced @jit decorator of numba and had discussed that numba can not speed up code involving pandas operations on pandas dataframe.
If you want to check our tutorial on numba @jit decorator then please feel free to check it from the below link.
We have created this tutorial to guide developers on how we can use numba to speed up code involving pandas dataframe. As a part of this tutorial, we'll explain with examples various ways to speed up your pandas’ operations. There are basically two ways to speed up pandas operations which we have listed below.
Below we have highlighted important sections of our tutorial to give an overview of the material covered in this tutorial.
We'll now explain two different ways of speeding up pandas code explained above with simple examples. We have imported the necessary libraries to start with below.
import pandas as pd
print("Pandas Version : {}".format(pd.__version__))
import numpy as np
Below we have created a dataframe with random data that we'll be using in our examples. The dataframe has 5 columns with random floats and one column has categorical values.
np.random.seed(123)
data = np.random.rand(int(1e5),5)
df = pd.DataFrame(data=data, columns=list("ABCDE"))
df["Type"] = np.random.choice(["Class1","Class2","Class3","Class4","Class5"], size=(len(df)))
df
In this section, we'll explain pandas dataframe methods which let us use numba for some operations. Pandas generally let us use numba with methods that work on a bunch of values of data like groupby(), rolling(), etc. This methods groups entries of main dataframes and then applies various aggregate functions on these grouped entries. We can inform them to use numba for performing aggregate operations on grouped entries by setting engine argument value to 'numba'.
Below we have first created a rolling dataframe with a window size of 1000. We can call various aggregate functions on this rolled dataframe to find our rolling statistics. The majority of functions which we can call on rolled dataframe accepts engine argument which can be set to 'numba'.
In the next cell below, we have grouped entries of the dataframe based on Type column of data. There are two methods (transform() and agg()/aggregate()) which works on grouped dataframes that accepts engine argument.
We'll be using these rolled and grouped dataframes in our examples.
rolling_df = df.rolling(1000)
rolling_df
grouped_by_types = df.groupby("Type")
grouped_by_types
In our first example, we are simply calling mean() function on rolled dataframe to calculate the rolling average on the dataframe. We have called mean() function with various arguments. We have called it without argument, with engine set to 'cython' and with engine set to 'numba'.
The cython is a different implementation of python which is faster compared to normal python implementation.
When we provide engine='numba', the function will use numba to speed up operations behind the scene. It's not guaranteed that using engine='numba' will always improve the performance. We need to test it first to check.
We are using the jupyter notebook magic command %time to measure the time taken by a particular statement. We'll be using it in all our examples to measure the time taken by various statements. If you are interested in learning about various magic commands available with the jupyter notebook then please feel free to check our tutorial on the same which covers the majority of magic commands.
%time out = rolling_df.mean()
%time out = rolling_df.mean(engine='cython')
%time out = rolling_df.mean(engine='numba')
%time out = rolling_df.mean(engine='numba')
In this section, we have again called mean() function on our rolling dataframe just like our previous example but there is one difference. We have provided raw=True to a method that will give a function that calculates average numpy arrays instead of pandas series. If we don't provide raw=True, it'll give values of columns as pandas series. We set raw=True because numba functions do well with function which operates on numpy arrays.
%time out = rolling_df.mean(raw=True)
%time out = rolling_df.mean(engine='cython', raw=True)
%time out = rolling_df.mean(engine='numba', raw=True)
%time out = rolling_df.mean(engine='numba', raw=True)
In the below cell, we have called std() function on our rolled dataframe to calculate rolling standard deviation. We have measured the time taken by each call. We can notice that numba seems to be doing better this time but not that much noticeable difference.
%time out = rolling_df.std(raw=True)
%time out = rolling_df.std(engine='cython', raw=True)
%time out = rolling_df.std(engine='numba', raw=True)
%time out = rolling_df.std(engine='numba', raw=True)
The methods which accept engine='numba' argument also let us specify various arguments that we generally provide to numba @jit decorator. The common arguments of numba @jit decorators are nopython, nogil, cache, and parallel.
In the below cell, we have tried to calculate standard deviation on our rolled dataframe by providing different arguments for numba engine. We can notice that the numba engine seems to be doing better this time compared to the normal call.
If you want to know in detail about these numba @jit decorator arguments then please feel free to check our tutorial on it which covers all arguments in detail with examples.
%time out = rolling_df.std(raw=True)
%time out = rolling_df.std(engine='cython', raw=True)
%time out = rolling_df.std(engine='numba', nopython=True, raw=True)
%time out = rolling_df.std(engine='numba', nopython=True, cache=True, raw=True)
%time out = rolling_df.std(engine='numba', nopython=True, cache=True, parallel=True, raw=True)
We can also provide custom user-defined functions to perform the different aggregate function which is not available through pandas.
Below we have created a simple custom function that takes as input arrays, squares its values, and then takes the mean of squared values. We'll be giving this function as an aggregate function on rolled dataframe.
def custom_mean(x):
return (x * x).mean()
In the below cell, we have called apply() function on our rolled dataframe asking it to execute the custom mean function we designed in the previous cell. Like our previous examples, we have tried function without any backend engine, cython engine, and numba engine.
We can notice from the results that numba is doing a little better job compared to other backend engines.
%time out = rolling_df.apply(custom_mean, raw=True)
%time out = rolling_df.apply(custom_mean, engine='cython', raw=True)
%time out = rolling_df.apply(custom_mean, engine='numba', raw=True)
%time out = rolling_df.apply(custom_mean, engine='numba', raw=True)
In the below cell, we have created a custom standard deviation function that takes the square of the input array and then calculates the standard deviation of squared values.
We have then tried this function on our rolled dataframe using apply() function. We have also recorded the time taken by each call for comparison purposes.
def custom_std(x):
return (x * x).std()
%time out = rolling_df.apply(custom_std, raw=True)
%time out = rolling_df.apply(custom_std, engine='cython', raw=True)
%time out = rolling_df.apply(custom_std, engine='numba', raw=True)
%time out = rolling_df.apply(custom_std, engine='numba', raw=True)
In the below cell, we have created a function that takes as input index and an array of values, it then calculates the mean of it.
We'll be using this function on our grouped dataframe to calculate the mean of grouped entries. We'll be comparing the time taken by different engines as usual.
Please make a NOTE that currently only transform(), agg() and aggregate() functions support engine argument which can be set to 'numba'. The agg() and aggregate() methods perform same function.
from numba import jit
def func(values, index):
return values.mean()
%time out = grouped_by_types.agg('mean')
%time out = grouped_by_types.agg('mean', engine='cython')
%time out = grouped_by_types.agg(func, engine='numba')
%time out = grouped_by_types.agg(func, engine='numba')
In this section, we'll be creating a @jit decorated function to work on our pandas dataframe. We'll compare the performance of these @jit decorated functions with other non-decorated functions. We'll also try to create functions to replace aggregate functions which are already provided by the pandas dataframe. Apart from @jit, we'll also try to use @vectorize decorator to speed up operations.
As we had said earlier, we'll be retrieving numpy arrays from our pandas dataframe before giving them to numba functions because numba works well with numpy arrays and python loops.
Please make a NOTE that the difference in the performance of numba functions might not be visible with small arrays. It becomes visible as array size increases. The numba functions also get compiled the first time we run it hence can take more time when we execute it the first time but all subsequent executions are quite faster as it uses compiled version from memory.
Below we have again created rolled dataframe and grouped dataframe like our previous section. We'll be trying various numba functions on them this time.
rolling_df = df.rolling(1000)
grouped_by_types = df.groupby("Type")
As a part of our first example, we have created two functions that perform the same operation on the input array but one of them is decorated with numba @jit decorator. The functions take as input arrays, squares their values, and then calculate the mean of squared values.
Then in the next cell, we have tried these functions on our rolled dataframe using apply() function. We have called apply() more than once with different backend engines (None, cython and numba) like our previous examples. We have also recorded the time taken by various executions.
We can notice from the results that @jit decorated function takes less time compared to the normal non-decorated function.
from numba import jit, njit, vectorize, float64
def custom_mean(x):
return (x * x).mean()
@jit(cache=True)
def custom_mean_jitted(x):
return (x * x).mean()
%time out = rolling_df.apply(custom_mean, raw=True)
%time out = rolling_df.apply(custom_mean_jitted, raw=True)
%time out = rolling_df.apply(custom_mean, engine='cython', raw=True)
%time out = rolling_df.apply(custom_mean_jitted, engine='cython', raw=True)
%time out = rolling_df.apply(custom_mean, engine='numba', raw=True)
%time out = rolling_df.apply(custom_mean_jitted, engine='numba', raw=True)
Our code for this example is almost exactly the same as our code from the previous example with one minor change. We are using @njit decorator instead of @jit decorator this time. The @njit decorator compiles the code of function in pure nopython mode of numba which is generally faster. We can force nopython mode when using @jit decorator as well by providing nopython=True argument to it.
If you want to know about numba nopython mode then please feel free to check our tutorial that covers it.
We have then executed these functions on our rolled dataframe using apply() function with different backends for comparison purposes.
from numba import jit, njit
def custom_mean(x):
return (x * x).mean()
@njit(cache=True)
def custom_mean_jittted(x):
return (x * x).mean()
%time out = rolling_df.apply(custom_mean, raw=True)
%time out = rolling_df.apply(custom_mean_jittted, raw=True)
%time out = rolling_df.apply(custom_mean, engine='cython', raw=True)
%time out = rolling_df.apply(custom_mean_jittted, engine='cython', raw=True)
%time out = rolling_df.apply(custom_mean, engine='numba', raw=True)
%time out = rolling_df.apply(custom_mean_jittted, engine='numba', raw=True)
We can further speed up our @jit decorated functions by providing input and output data types. The numba will create compiled version based on those datatypes which can improve performance. Below we have provided float64 as the input and output data type of our function.
We have then called these functions on our rolled dataframe using apply() method with different backend engines for comparing performance.
from numba import jit, njit, float64
def custom_mean(x):
return (x * x).mean()
@jit(float64(float64[:]), nopython=True, cache=True)
def custom_mean_jitted(x):
return (x * x).mean()
%time out = rolling_df.apply(custom_mean, raw=True)
%time out = rolling_df.apply(custom_mean_jitted, raw=True)
%time out = rolling_df.apply(custom_mean, engine='cython', raw=True)
%time out = rolling_df.apply(custom_mean_jitted, engine='cython', raw=True)
%time out = rolling_df.apply(custom_mean, engine='numba', raw=True)
%time out = rolling_df.apply(custom_mean_jitted, engine='numba', raw=True)
As we know that numba works really well with python loops, we can also modify our function to work with python loops. In this example, we have modified our @jit decorated function to calculate the mean of squared values in the loop.
We have then executed these functions on our rolled dataframe with different backend engines to compare performance. We can notice that it seems to be doing a little better compared to our previous examples.
from numba import jit, njit, vectorize, float64
def custom_mean(x):
return (x * x).mean()
@jit(float64(float64[:]), nopython=True, cache=True)
def custom_mean_loops_jitted(x):
out = 0.0
for i in x:
out += (i*i)
return out / len(x)
%time out = rolling_df.apply(custom_mean, raw=True)
%time out = rolling_df.apply(custom_mean_loops_jitted, raw=True)
%time out = rolling_df.apply(custom_mean, engine='cython', raw=True)
%time out = rolling_df.apply(custom_mean_loops_jitted, engine='cython', raw=True)
%time out = rolling_df.apply(custom_mean, engine='numba', raw=True)
%time out = rolling_df.apply(custom_mean_loops_jitted, engine='numba', raw=True)
In this example, we'll create a custom @jit decorated function to replace an existing mean() function available from the pandas dataframe.
Below we have first calculated the mean of 5 columns of the dataframe using the in-built mean() function and recorded the time taken for the operation.
%time out = df[list("ABCDE")].mean()
In the below cell, we have designed a function that takes as input numpy array and calculates the mean of it. We have @jit decorated the function and also specified input/output data types. We have also provided nopython=True argument to run numba in strict nopython mode.
from numba import jit, njit, vectorize, float64, float32
@jit([float32(float32[:]), float64(float64[:])], nopython=True, cache=True)
def custom_mean(x):
return x.mean()
In the below cell, we have looped through column names of the pandas’ data frame and calculated the mean of them using our custom mean function. We have also recorded the time taken to calculate the mean of all columns. We can notice that it takes a little less time compared to pandas’ in-built function. We think that this difference will increase with an increase in the size of the array and the number of columns.
Please make a NOTE that difference in performance will be more visible as array size increases and goes beyond 1M values.
%%time
avg_cols = {}
for col in list("ABCDE"):
avg_cols[col] = custom_mean(df[col].values)
In this section, we'll explain another example where we'll use @vectorize decorator to replace the existing function of pandas.
Below we have taken a column of our pandas dataframe, squared its values, and then added scaler value 2 to it. We have performed this operation by providing a simple function to apply() method. We have recorded the time taken by this operation.
In the next cell below, we have calculated the square of the pandas’ column by using a simple multiplication operation. We have recorded the time taken to perform an operation in this way as well.
%time out = df.A.apply(lambda x : x**2 + 2)
%time out = (df.A.values * df.A.values) + 2
In the below cell, we have created a simple function that takes as input a single value, squares it, and adds scalar value 2 to it. We have then vectorized this function using numba @vectorize function. We'll be using this function to perform the same operation we performed using pandas in-built method earlier in previous cells.
If you want to know about how numba @vectorize decorator works then please feel free to check our tutorial on it from the below link.
from numba import vectorize, float32, float64
@vectorize([float32(float32), float64(float64)])
def square(x):
return x**2 + 2
In the below cell, we have called our vectorized function 3 times on values of the column of our dataframe. We have recorded the time taken by this function all 3 times. We can notice that our vectorized function takes quite less time compared to pandas’ in-built functionalities.
Please make a NOTE that difference in performance will be more visible as array size increases and goes beyond 1M values.
%time out = square(df["A"].values)
%time out = square(df["A"].values)
%time out = square(df["A"].values)
This ends our small tutorial explaining how we can create code using numba when working with pandas dataframe to speed up code involving dataframes. Please feel free to let us know your views in the comments section.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to