Updated On : Dec-07,2021 Time Investment : ~20 mins

Numba: Make your Python Code Run Faster like C/C++¶

Python is an interpreter-based language hence it's slow compared to other compiler-based languages like C/C++. Due to this python was not used in any performance-intensive application. To solve this problem, a python library named Numba was developed. Numba is generally referred to as Just-In-Time (JIT) compiler of python code which can speed some parts of or all of the python code by converting it to low-level machine instructions. It uses LLVM library for converting python code to machine instructions. We can rerun code compiled with Numba and it'll run almost faster like C/C++ language code.

The process of using Numba to speed up code is quite simple. Numba provides us with a list of decorators that we can use to decorate our functions and it'll compile them when we call the function the first time. Each subsequent call will be using that compiled version hence will be faster. When a function decorated with Numba decorators is called, it'll be compiled first to generate faster machine code hence it'll take a little more time. Once the code is compiled then recalling such function will be way faster because the compiled version will be called subsequently.

Numba reads python bytecode of function covered with Numba decorator, converts its input arguments and other data used inside the function to Numba datatypes, optimizes various parts, and converts it to machine code using LLVM library. If a function is designed to use with various data types (generic functions) then Numba will take time to compile the function each time function is called with a new data type it hasn't seen before. Because it'll be creating a different compilation of the same generic function with different datatypes.

Please make a NOTE that Numba can only translate a certain subset of python code which involves loops and code involving numpy to faster machine code. Not everything will be running faster using Numba. One needs to have basic knowledge of what can be parallelized and what not to make efficient use of Numba. We'll help understand how to use Numba better in various situations in this tutorial.

As a part of this tutorial, we'll be covering how we can speed up our python functions using Numba. We'll be explaining @jit and @njit decorators available from Numba. Below we have highlighted important sections of the tutorial to give an overview of the material that we'll be covering in this tutorial.

Important Sections of Tutorial¶

Example 1: Introducing @jit (Object Mode)
Example 2 @jit & @njit (Strict nopython Mode)
Example 3: Specify DataType in Signatures
Example 4: Cache Compiled Code to Speed Up Frequent Runs
Example 5: Parallelize Code for Multi-Core CPU (Uses Multi-Threading to Parallelize)
Example 6: "fastmath" for Faster Mathematical Operations
Example 7: Release Python GIL during Multi-Threading on Multi-Core CPU
Example 8: Numba does not Improve Pandas Code

Installing Numba¶

We can easily install Numba using pip or conda.

pip install numba
conda install numba

Below we have imported Numba and printed the version of it that we have used in this tutorial.

import numba

print("Numba Version : {}".format(numba.__version__))

Numba Version : 0.54.1

import numpy as np

import pandas as pd

Example 1: Introducing @jit (Object Mode) ¶

In this example, we'll be introducing the first decorator available from Numba named @jit to speed up our python function. We can decorate any python function with @jit decorator and it should speed up the python function.

The @jit decorator will compile a python code of function decorated with it. The @jit decorator generally works in one of the below-mentioned two modes.

object mode - The @jit decorator generally tries to convert the whole python function to low-level machine code but if it fails to convert the whole code then it still converts some parts of the function involving loops and other constructs which it can convert to low-level machine code. By default @jit decorator works in object mode where it converts some parts of the code or whole code to low-level machine code.
nopython mode - The @jit decorator can be used in another nopython mode will strictly try to convert total code of python function to low-level machine code. We can set @jit decorator to work in nopython mode by setting nopython argument inside it to True. If the whole function can not be compiled to low-level code then compilation will fail with an error. This mode is generally preferred over object mode and way faster than it if it can be used.

Users can test whether their function can run with nopython mode first and if it works then use that mode otherwise fall back to object mode. If you know that your whole function can be converted using Numba then it's preferred to use nopython mode. If your function is designed in a way that some parts of it can be converted to Numba and some will run in pure python then it's preferred to run in object mode.

When testing Numba @jit decorator, if it does not seem to improve performance then it's better to remove @jit decorator and fall back to using pure python and find out some other ways to improve performance. Because using @jit decorator with functions that can't be converted to Numba might result in worsening performance as it'll take time to compile function the first time and will result in no performance improvement. Hence time taken first time to compile functions to convert will add up overhead.

Please make a NOTE that Numba generally does not speed up code involving list-comprehensions and it is suggested to fall back and convert function using comprehensions to loop-based again for faster performance.

In this section, we have created two small examples to explain the usage of @jit decorator in object mode.

1.1 Example 1¶

In our first example, we have created a simple function that takes as input an array of arbitrary size and performs a cube formula on each individual element of the array. The perform_operation() function takes as input an array and executes cube_formula() function on each element of the array recording their results.

def cube_formula(x):
    return x**3 + 3*x**2 + 3

def perform_operation(x):
    out = np.empty_like(x)
    for i, elem in enumerate(x):
        res = cube_formula(elem)
        out[i] = res
    return out

After defining the functions, we have executed our main function with two different arrays of numbers where the first one consists of 1M numbers and the second one consists of 10M numbers. We have also recorded the time taken by functions as we'll be comparing it against @jit decorated functions. We have used the jupyter magic command %time to measure the execution time of a particular statement.

Please make a NOTE that speed up provided by Numba @jit decorated functions will be different on different computers as it's based on low-level machine instructions available to LLVM Compiler on the particular computer which can differ from computer to computer.

If you are interested in learning about cell magic commands (like %time which we have used in this tutorial) available in jupyter notebook then please feel free to check our tutorial on the same. It covers the majority of jupyter notebook magic commands.

List of Useful Magic Commands in Jupyter Notebook/Lab

%time out = perform_operation(np.arange(1e6))

CPU times: user 882 ms, sys: 4.24 ms, total: 886 ms
Wall time: 886 ms

%time out = perform_operation(np.arange(1e7))

CPU times: user 7.67 s, sys: 23.1 ms, total: 7.7 s
Wall time: 7.7 s

Below we have re-defined both of our functions again but this time decorated them again with @jit decorator.

from numba import jit

@jit
def cube_formula(x):
    return x**3 + 3*x**2 + 3

@jit
def perform_operation_jitted(x):
    out = np.empty_like(x)
    for i, elem in enumerate(x):
        res = cube_formula(elem)
        out[i] = res
    return out

Below we have executed the jit-decorated function with two different arrays of different sizes. We have used the same arrays which we had used when testing function normally.

We can notice from the results of time taken by both functions that it takes literally a lot less compared to what it used to take without @jit. The @jit decorator has improved the performance by quite a big margin.

%time out = perform_operation_jitted(np.arange(1e6))

CPU times: user 127 ms, sys: 146 µs, total: 127 ms
Wall time: 125 ms

%time out = perform_operation_jitted(np.arange(1e7))

CPU times: user 43 ms, sys: 32 ms, total: 75 ms
Wall time: 74.2 ms

1.2 Example 2¶

In this section, we have defined one more function to explain the usage of @jit decorator. We have simply created a function that simply executes loop inside of loop and records indices of all combinations. The first loop executes 10000 times and the inside loop executes 1000 times.

After defining the function, we have executed it 3 times and recorded the time taken by it each time for comparison purposes later.

def calculate_all_permutations():
    perms = []
    for i in range(int(1e4)):
        for j in range(int(1e3)):
            perms.append((i,j))

    return perms

%time perms = calculate_all_permutations()

%time perms = calculate_all_permutations()

%time perms = calculate_all_permutations()

CPU times: user 1.03 s, sys: 260 ms, total: 1.29 s
Wall time: 1.3 s
CPU times: user 1.07 s, sys: 208 ms, total: 1.28 s
Wall time: 1.28 s
CPU times: user 1.12 s, sys: 232 ms, total: 1.35 s
Wall time: 1.35 s

Now, we have again defined our function but this time decorated it with @jit decorator. We have then rerun this @jit decorated function 3 times to record the time taken by it. We can notice from the results that it takes quite less time compared to normal function. Also, subsequent calls to @jit decorated function take less time because it uses an already compiled version.

@jit
def calculate_all_permutations():
    perms = []
    for i in range(int(1e4)):
        for j in range(int(1e3)):
            perms.append((i,j))

    return perms

%time perms = calculate_all_permutations()

%time perms = calculate_all_permutations()

%time perms = calculate_all_permutations()

CPU times: user 115 ms, sys: 76.1 ms, total: 191 ms
Wall time: 193 ms
CPU times: user 53.3 ms, sys: 52 ms, total: 105 ms
Wall time: 104 ms
CPU times: user 39.1 ms, sys: 60.1 ms, total: 99.2 ms
Wall time: 99 ms

Example 2 @jit & @njit (Strict nopython Mode) ¶

In this section, we have run our examples in nopython mode of Numba @jit decorator. There are two ways in which we can force nopython mode.

@jit(nopython=True)
@njit

We'll be using both in our examples.

2.1 Example 1¶

In this section, we have redefined our functions again and decorated them with @jit decorators. But this time, we have set nopython argument of @jit decorator to True which is False by default. This will force Numba to run in strict nopython mode and convert all the code of the function to low-level machine code. This mode is generally preferred as it works fast compared to object mode.

Our current functions are designed in a way that they can be totally converted to low-level machine code using Numba.

If you use @jit decorator in nopython mode then Numba will try to compile your function immediately and if it could not convert some parts then it'll fail with an error. If your function fails to compile in nopython mode then it’s advisable to either use object mode or divide functions into more functions and use nopython mode whenever possible on sub-parts.

from numba import jit

@jit(nopython=True)
def cube_formula(x):
    return x**3 + 3*x**2 + 3

@jit(nopython=True)
def perform_operation_jitted(x):
    out = np.empty_like(x)
    for i, elem in enumerate(x):
        res = cube_formula(elem)
        out[i] = res
    return out

Below we have executed our function two times, ones with an array of size 1M and one's with an array of size 10. We have also recorded the time taken by both. We can notice from the results that it seems to be taking the almost same time as our previous object mode runs. Though it does not seem to improve the performance of these functions much further, it’s generally the preferred mode to use whenever possible as it can speed up code more.

%time out = perform_operation_jitted(np.arange(1e6))

CPU times: user 174 ms, sys: 7.91 ms, total: 182 ms
Wall time: 181 ms

%time out = perform_operation_jitted(np.arange(1e7))

CPU times: user 39.3 ms, sys: 48 ms, total: 87.3 ms
Wall time: 86 ms

Below we have introduced another way of using nopython mode by decorating our functions with @njit decorator. We have then also run our @njit decorated function two times, ones with an array of size 1M and ones with an array of size 10M. We can notice from the results that the time taken is almost the same as using nopython=True inside of @jit decorator.

from numba import njit

@njit
def cube_formula(x):
    return x**3 + 3*x**2 + 3

@njit
def perform_operation_jitted(x):
    out = np.empty_like(x)
    for i, elem in enumerate(x):
        res = cube_formula(elem)
        out[i] = res
    return out

%time out = perform_operation_jitted(np.arange(1e6))

CPU times: user 142 ms, sys: 3.92 ms, total: 146 ms
Wall time: 144 ms

%time out = perform_operation_jitted(np.arange(1e7))

CPU times: user 33.5 ms, sys: 39.7 ms, total: 73.3 ms
Wall time: 71.9 ms

2.2 Example 2¶

In this section, we have @njit decorated our second example which we had run in object mode explanation section earlier. We have then executed the function three times to check performance. We can notice from the results that the time taken is almost the same or a little better compared to object mode runs.

@njit
def calculate_all_permutations():
    perms = []
    for i in range(int(1e4)):
        for j in range(int(1e3)):
            perms.append((i,j))

%time perms = calculate_all_permutations()

%time perms = calculate_all_permutations()

%time perms = calculate_all_permutations()

CPU times: user 129 ms, sys: 52.2 ms, total: 182 ms
Wall time: 180 ms
CPU times: user 38.4 ms, sys: 60.1 ms, total: 98.5 ms
Wall time: 98.1 ms
CPU times: user 35.5 ms, sys: 60.1 ms, total: 95.6 ms
Wall time: 95.2 ms

Example 3: Specify DataType in Signatures ¶

When Numba compiles the code, it internally creates a version for each different data type with which a function is run. Each time a @jit decorated function is run with a new data type, Numba needs to compile the function first with this new data type and create a new data type version for future use. All subsequent calls for this recorded data type will be faster.

We can also separately specify input and output data types of our @jit decorated function. This will create a compiled version for the specified data type when the function is defined and not when the function is first called with that data type.

We can provide data types as the first argument of the decorator. We can specify the input and out data types of function using ret_type(param1_type, param2_type, ...) format. The input parameters data type is specified inside of parenthesis and return type is specified outside of parenthesis at the beginning. The data type that we can use in @jit decorator needs to be imported from Numba. If input or output element is an array then we can represent it using strong '[:]' followed by data type.

Please make a NOTE that when declaring functions with the data type, Numba will only allow us to execute functions with specified data types. All calls of any other data type will fail.

Below we have redefined our functions which we have been using for the last few examples again but this time we have provided input/output data types as well. We have decorated our function with int64 data type for both input and output. This will create a compiled version for this data type when we execute the below cell. Now, when we execute these functions with int64 data types, it does not need compilation again, it'll just run them immediately.

As we have declared our functions with input/output data types as integer, if we call the below functions with float data types then it'll fail.

from numba import jit, int64, float32, float64

@jit(int64(int64), nopython=True)
def cube_formula(x):
    return x**3 + 3*x**2 + 3

@jit(int64[:](int64[:]), nopython=True)
def perform_operation_jitted(x):
    out = np.empty_like(x)
    for i, elem in enumerate(x):
        res = cube_formula(elem)
        out[i] = res
    return out

Below we have run our jit-decorated function with data types, first with an array of 1M int64 numbers and then with an array of 10M int64 numbers. We have also recorded the time taken by both. We can notice from the time that it has improved further compared to all our previous versions.

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

CPU times: user 17 ms, sys: 12.1 ms, total: 29.1 ms
Wall time: 27.4 ms

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.int64))

CPU times: user 81.9 ms, sys: 72 ms, total: 154 ms
Wall time: 152 ms

If our input function works with different data types then we can specify more than one signature as well inside of @jit decorator. The data type signatures can be specified as a list.

Below we have specified two different data types signatures for our functions. Numba will internally create compiled versions for both data types. Now our functions can run with these two data types, call with some other data type will fail.

from numba import jit, int64, float32, float64

@jit([int64(int64), float64(float64)], nopython=True)
def cube_formula(x):
    return x**3 + 3*x**2 + 3

@jit([int64[:](int64[:]), float64[:](float64[:])], nopython=True)
def perform_operation_jitted(x):
    out = np.empty_like(x)
    for i, elem in enumerate(x):
        res = cube_formula(elem)
        out[i] = res
    return out

Below we have executed our functions with arrays of sizes 1M and 10M respectively. We have first executed them with int64 data type and then with float64. We have also recorded the time taken by both. We can notice from the time taken that it has improved quite a lot compared to our examples where we had not declared data types.

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

CPU times: user 5.23 ms, sys: 8.08 ms, total: 13.3 ms
Wall time: 12.9 ms

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

CPU times: user 13.3 ms, sys: 7.86 ms, total: 21.1 ms
Wall time: 19.2 ms

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.int64))

CPU times: user 86.6 ms, sys: 88 ms, total: 175 ms
Wall time: 174 ms

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

CPU times: user 90.9 ms, sys: 59.6 ms, total: 151 ms
Wall time: 146 ms

Example 4: Cache Compiled Code to Speed Up Frequent Runs ¶

When we call a @jit decorated function with the particular data type, Numba creates a machine code for it. This compilation can take time. We can avoid this compilation time if we are calling functions more than once by setting cache argument of @jit decorator to True.

Numba will internally use file-based cache to maintain compiled versions of functions.

Below we have re-defined our functions with cache argument set to True.

from numba import jit, int32, int64, float32, float64

@jit([int32(int32), int64(int64), float64(float64)], nopython=True, cache=True)
def cube_formula(x):
    return x**3 + 3*x**2 + 3

@jit([int64[:](int64[:]), float64[:](float64[:])], nopython=True, cache=True)
def perform_operation_jitted(x):
    out = np.empty_like(x)
    for i, elem in enumerate(x):
        res = cube_formula(elem)
        out[i] = res
    return out

Below we have executed our functions three times using the same array of 1M integer numbers. We have also recorded the time taken for executions. We can notice from the time taken by executions that they are the lowest of all our tries till now.

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

CPU times: user 7.02 ms, sys: 200 µs, total: 7.22 ms
Wall time: 5.7 ms
CPU times: user 2.23 ms, sys: 3.63 ms, total: 5.86 ms
Wall time: 5.28 ms
CPU times: user 4.19 ms, sys: 12 µs, total: 4.21 ms
Wall time: 3.97 ms

Below we have executed our functions three times using the same array of 1M float numbers. We have also recorded the time taken for executions. The time taken for executions is the least of all our tries till now.

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

CPU times: user 7.41 ms, sys: 150 µs, total: 7.56 ms
Wall time: 5.5 ms
CPU times: user 6.05 ms, sys: 0 ns, total: 6.05 ms
Wall time: 5.16 ms
CPU times: user 4.85 ms, sys: 4 µs, total: 4.86 ms
Wall time: 4.27 ms

Below we have executed our functions three times using the same array of 10M float numbers. We have also recorded the time taken for executions. The time taken is the least of all our tries of the same functions till now.

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

CPU times: user 39.5 ms, sys: 36.1 ms, total: 75.6 ms
Wall time: 76.8 ms
CPU times: user 27.5 ms, sys: 32.1 ms, total: 59.6 ms
Wall time: 58.8 ms
CPU times: user 18.6 ms, sys: 27.7 ms, total: 46.3 ms
Wall time: 46.1 ms

Example 5: Parallelize Code for Multi-Core CPU (Uses Multi-Threading to Parallelize) ¶

Numba can also parallelize our code on multi-core CPUs. It uses multi-threading to speed up code by running threads on different cores of the computer in parallel. In order to parallelize code, we need to set parallel parameter of @jit decorator to True. There are two types of parallelization available in Numba

Automatic Parallelization - When we decorate our function with @jit(parallel=True) decorator, Numba will try to run function in parallel if possible else it'll run it normally.
Explicit Parallel Loops - We can explicitly force Numba to run code in parallel by using prange() function available from Numba for our loops. This will force Numba to parallelize code.

In our example, we'll use explicit parallelization by using prange() function.

Please make a NOTE that Python Global Interpreter Lock (GIL) can prevent the speed up of multi-threading. We'll explain in our upcoming examples how we can release GIL and get around this problem.

Below we have re-defined our functions and set parallel parameter to True inside of @jit decorator. We have also modified the logic of our perform_operation_jitted() function to use prange() function. We are using index retrieved from prange() function to index array and retrieve individual element.

from numba import jit, int64, float32, float64, prange

@jit([int64(int64), float64(float64)], nopython=True, cache=True)
def cube_formula(x):
    return x**3 + 3*x**2 + 3

@jit([int64[:](int64[:]), float64[:](float64[:])], nopython=True, cache=True, parallel=True)
def perform_operation_jitted(x):
    out = np.empty_like(x)
    for i in prange(len(x)):
        res = cube_formula(x[i])
        out[i] = res
    return out

Now, we have run our parallelized function with arrays of size 1M and 10M to test their performance. We have also recorded the time taken by them. We have first used an array of 1M integers, then an array of 1M floats, and at last, an array of 10M floats.

We can notice from that time taken by executions that performance has improved compared to non-parallelized versions.

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

CPU times: user 2.72 ms, sys: 3.71 ms, total: 6.43 ms
Wall time: 4.13 ms
CPU times: user 62.2 ms, sys: 0 ns, total: 62.2 ms
Wall time: 24.9 ms
CPU times: user 21.7 ms, sys: 0 ns, total: 21.7 ms
Wall time: 9.51 ms

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

CPU times: user 28.7 ms, sys: 0 ns, total: 28.7 ms
Wall time: 17.8 ms
CPU times: user 40.5 ms, sys: 0 ns, total: 40.5 ms
Wall time: 13.9 ms
CPU times: user 11.4 ms, sys: 0 ns, total: 11.4 ms
Wall time: 3.96 ms

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

CPU times: user 48.1 ms, sys: 55.4 ms, total: 104 ms
Wall time: 56.3 ms
CPU times: user 91.5 ms, sys: 52.1 ms, total: 144 ms
Wall time: 51.9 ms
CPU times: user 80.9 ms, sys: 40.1 ms, total: 121 ms
Wall time: 33.2 ms

Example 6: "fastmath" for Faster Mathematical Operations ¶

Numba provides some additional performance in some situations by setting fastmath parameter to True inside of @jit decorator. The fastmath option, when set to True, will relax some numerical strict rules and perform approximate arithmetic & mathematical functions. If Intel's short vector math library (SVML) is installed on the system, then Numba can utilize it to improve performance when fastmath is set to True.

How to Install Intel SVML¶

We can install intel's SVML library using the below conda command. Please see this link for more details on SVML.

conda install -c numba icc_rt

In this section, we have first fastmath normally and then along with parallel argument of @jit decorator.

1.1 Only "fastmath"¶

In this section, we have first re-defined our functions and decorated them with @jit decorator. We have set fastmath parameter to True along with nopython and cache parameters. We have also provided data types for inputs/outputs of functions.

from numba import jit, int64, float32, float64, prange

@jit([int64(int64), float64(float64)], nopython=True, cache=True, fastmath=True)
def cube_formula(x):
    return x**3 + 3*x**2 + 3

@jit([int64[:](int64[:]), float64[:](float64[:])], nopython=True, cache=True, fastmath=True)
def perform_operation_jitted(x):
    out = np.empty_like(x)
    for i, elem in enumerate(x):
        res = cube_formula(elem)
        out[i] = res
    return out

Below we have tested our @jit decorated and fastmath set functions three times using different inputs.

First, we have executed it with 1M integers three times, followed by 1M floats three times and at last, 10M floats three times. We can notice from the time recorded for executions that it seems to have improved performance a little bit.

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

CPU times: user 13.7 ms, sys: 12.1 ms, total: 25.8 ms
Wall time: 24 ms
CPU times: user 7.21 ms, sys: 0 ns, total: 7.21 ms
Wall time: 6.89 ms
CPU times: user 4.4 ms, sys: 18 µs, total: 4.42 ms
Wall time: 3.89 ms

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

CPU times: user 5.69 ms, sys: 0 ns, total: 5.69 ms
Wall time: 4.15 ms
CPU times: user 4.74 ms, sys: 0 ns, total: 4.74 ms
Wall time: 4.24 ms
CPU times: user 4.51 ms, sys: 0 ns, total: 4.51 ms
Wall time: 4.1 ms

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

CPU times: user 35.6 ms, sys: 35.7 ms, total: 71.3 ms
Wall time: 79.1 ms
CPU times: user 86.2 ms, sys: 112 ms, total: 198 ms
Wall time: 202 ms
CPU times: user 116 ms, sys: 164 ms, total: 280 ms
Wall time: 282 ms

1.2 "fastmath" + "parallel"¶

In this section, we have re-defined the functions that we have been using for the last few examples. We have @jit decorated it along with options nopython, cache, fastmath, and parallel set to True.

from numba import jit, int64, float32, float64, prange

@jit([int64(int64), float64(float64)], nopython=True, cache=True, fastmath=True)
def cube_formula(x):
    return x**3 + 3*x**2 + 3

@jit([int64[:](int64[:]), float64[:](float64[:])], nopython=True, cache=True, fastmath=True, parallel=True)
def perform_operation_jitted(x):
    out = np.empty_like(x)
    for i in prange(len(x)):
        res = cube_formula(x[i])
        out[i] = res
    return out

Below we have tested our fastmath optimized and parallelized functions by executing with different arrays three times. We have also recorded the time taken by each for comparison. We can notice from the results that there is almost the same time as that of the parallel section above.

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

CPU times: user 27 ms, sys: 7.95 ms, total: 35 ms
Wall time: 14.2 ms
CPU times: user 12.6 ms, sys: 3.56 ms, total: 16.1 ms
Wall time: 9.19 ms
CPU times: user 21 ms, sys: 0 ns, total: 21 ms
Wall time: 10.6 ms

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

CPU times: user 28 ms, sys: 0 ns, total: 28 ms
Wall time: 10.3 ms
CPU times: user 11.7 ms, sys: 0 ns, total: 11.7 ms
Wall time: 3.37 ms
CPU times: user 32.6 ms, sys: 0 ns, total: 32.6 ms
Wall time: 15.4 ms

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

CPU times: user 46.9 ms, sys: 35.1 ms, total: 82 ms
Wall time: 43.5 ms
CPU times: user 92.8 ms, sys: 55.8 ms, total: 149 ms
Wall time: 60 ms
CPU times: user 57 ms, sys: 51.7 ms, total: 109 ms
Wall time: 33.2 ms

Example 7: Release Python GIL during Multi-Threading on Multi-Core CPU ¶

One of the python's drawbacks when using multi-threading is GIL which does not let python actually execute more than one thread in parallel in a few situations. To overcome this drawback, Numba let us skip python GIL by setting nogil parameter to True inside @jit decorator. When Numba can convert the majority of python code to low-level machine code, then it's not necessary to hold python's GIL.

Our functions for this example are exact copies of the functions we had defined in example 5 (with one minor change) when explaining how we can use multi-threading with Numba @jit decorator by setting parallel=True. We have set parameter GIL to True as well this time to let python release GIL.

from numba import jit, int64, float32, float64

@jit([int64(int64), float64(float64)], nopython=True, nogil=True)
def cube_formula(x):
    return x**3 + 3*x**2 + 3

@jit([int64[:](int64[:]), float64[:](float64[:])], nopython=True, nogil=True, parallel=True)
def perform_operation_jitted(x):
    out = np.empty_like(x)
    for i in prange(len(x)):
        res = cube_formula(x[i])
        out[i] = res
    return out

Below we have executed our jit-decorated function three times using different inputs. First we have executed function with an array of 1M integers three times, then with an array of 1M floats three times and at last with an array of 10M floats three times. We have recorded the time taken by function each time. We can notice from the time recorded that the function seems to be doing better compared to the majority of our previous trials.

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.int64))

CPU times: user 30.8 ms, sys: 0 ns, total: 30.8 ms
Wall time: 13.4 ms
CPU times: user 47.2 ms, sys: 0 ns, total: 47.2 ms
Wall time: 17.3 ms
CPU times: user 3.67 ms, sys: 95 µs, total: 3.76 ms
Wall time: 3.12 ms

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e6, dtype=np.float64))

CPU times: user 26.6 ms, sys: 0 ns, total: 26.6 ms
Wall time: 11.3 ms
CPU times: user 37.5 ms, sys: 0 ns, total: 37.5 ms
Wall time: 14.5 ms
CPU times: user 2.89 ms, sys: 30 µs, total: 2.92 ms
Wall time: 2.92 ms

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

%time out = perform_operation_jitted(np.arange(1e7, dtype=np.float64))

CPU times: user 67.7 ms, sys: 36.1 ms, total: 104 ms
Wall time: 55.5 ms
CPU times: user 98.8 ms, sys: 72.1 ms, total: 171 ms
Wall time: 67.4 ms
CPU times: user 74.7 ms, sys: 35.5 ms, total: 110 ms
Wall time: 33 ms

Example 8: Numba does not Improve Pandas Code ¶

As we have highlighted many times, Numba works well with python loops and numpy. Though Pandas is built on top of numpy but still Numba can not improve code involving pandas data structures using pandas operations. The reason behind this can be that Numba does not have access to lower-level code behind pandas API which it can optimize.

Below we have created a simple function that takes as input pandas dataframe and performs some operations on columns of pandas dataframe. It then returns a modified data frame. We have first run the function normally 3 times and recorded the time of each run.

We have then @jit decorated the same function and run it again three times. We have recorded the time taken by this jit-decorated function as well. We can clearly see from the results that @jit decorator does not seem to improve results. It even increases the time taken by the function.

The below examples show that using Numba involving only pandas code will not result in improving performance. It can even backfire and can take time to run the first time as seen below. Because it tried to convert code to Numba for improving performance but it failed and fall back to pure python at last.

Though decorating functions involving pandas data frame with @jit decorator does not seem to improve results, but there are ways to improve functions involving pandas dataframe. We have discussed how we can improve the code involving pandas dataframe using Numba and its decorators in a separate tutorial. Please feel free to check it.

How to Speed up Code involving Pandas DataFrame using Numba?

def work_on_dataframe(df):
    df['Col1'] = (df.Col1 * 100)
    df['Col2'] = (df.Col1 * df.Col3)
    df = df.where((df > 100) & (df < 10000))
    df = df.dropna(how='any')
    return df

data = {'Col1': range(10000), 'Col2': range(10000), 'Col3': range(10000)}
df = pd.DataFrame(data=data)

%time df = work_on_dataframe(df)
%time df = work_on_dataframe(df)
%time df = work_on_dataframe(df)

CPU times: user 5.01 ms, sys: 27 µs, total: 5.04 ms
Wall time: 4.85 ms
CPU times: user 1.94 ms, sys: 421 µs, total: 2.36 ms
Wall time: 2.19 ms
CPU times: user 2.81 ms, sys: 0 ns, total: 2.81 ms
Wall time: 2.62 ms

from numba import jit

@jit
def work_on_dataframe(df):
    df['Col1'] = (df.Col1 * 100)
    df['Col2'] = (df.Col1 * df.Col3)
    df = df.where((df > 100) & (df < 10000))
    df = df.dropna(how='any')
    return df

data = {'Col1': range(1000), 'Col2': range(1000), 'Col3': range(1000)}
df = pd.DataFrame(data=data)

%time df = work_on_dataframe(df)
%time df = work_on_dataframe(df)
%time df = work_on_dataframe(df)

CPU times: user 123 ms, sys: 3.85 ms, total: 127 ms
Wall time: 130 ms
CPU times: user 116 µs, sys: 3.21 ms, total: 3.32 ms
Wall time: 2.86 ms
CPU times: user 4.43 ms, sys: 88 µs, total: 4.52 ms
Wall time: 4.2 ms

<ipython-input-17-05273aead2fd>:3: NumbaWarning: 
Compilation is falling back to object mode WITH looplifting enabled because Function "work_on_dataframe" failed type inference due to: non-precise type pyobject
During: typing of argument at <ipython-input-17-05273aead2fd> (5)

File "<ipython-input-17-05273aead2fd>", line 5:
def work_on_dataframe(df):
    df['Col1'] = (df.Col1 * 100)
    ^

  @jit
/home/sunny/anaconda3/lib/python3.7/site-packages/numba/core/object_mode_passes.py:152: NumbaWarning: Function "work_on_dataframe" was compiled in object mode without forceobj=True.

File "<ipython-input-17-05273aead2fd>", line 4:
@jit
def work_on_dataframe(df):
^

  state.func_ir.loc))
/home/sunny/anaconda3/lib/python3.7/site-packages/numba/core/object_mode_passes.py:162: NumbaDeprecationWarning: 
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

For more information visit https://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit

File "<ipython-input-17-05273aead2fd>", line 4:
@jit
def work_on_dataframe(df):
^

  state.func_ir.loc))

This ends our small tutorial explaining Numba @jit decorator to speed-up python code. Please feel free to let us know your views in the comments section.

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Want to Share Your Views? Have Any Suggestions?

If you want to

provide some suggestions on topic
share your views
include some details in tutorial
suggest some new topics on which we should create tutorials/blogs

Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.

python, optimisation

Sunny Solanki

Software Developer | Youtuber | Bonsai Enthusiast

Subscribe to Our YouTube Channel

Tutorial Categories

Artificial Intelligence (83)
Data Science (84)
Digital Marketing (8)
Machine Learning (38)
Python (131)

Numba: Make your Python Code Run Faster like C/C++¶

Important Sections of Tutorial¶

Installing Numba¶

Example 1: Introducing @jit (Object Mode) ¶

1.1 Example 1¶

1.2 Example 2¶

Example 2 @jit & @njit (Strict nopython Mode) ¶

2.1 Example 1¶

2.2 Example 2¶

Example 3: Specify DataType in Signatures ¶

Example 4: Cache Compiled Code to Speed Up Frequent Runs ¶

Example 5: Parallelize Code for Multi-Core CPU (Uses Multi-Threading to Parallelize) ¶

Example 6: "fastmath" for Faster Mathematical Operations ¶

How to Install Intel SVML¶

1.1 Only "fastmath"¶

1.2 "fastmath" + "parallel"¶

Example 7: Release Python GIL during Multi-Threading on Multi-Core CPU ¶

Example 8: Numba does not Improve Pandas Code ¶

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

Want to Share Your Views? Have Any Suggestions?

Sunny Solanki

Subscribe to Our YouTube Channel

Tutorial Categories

Newsletter Subscription