wonderful language for fast prototyping and code improvement, however one factor I typically hear individuals say about utilizing it’s that it’s gradual to execute. It is a specific ache level for knowledge scientists and ML engineers, as they typically carry out computationally intensive operations, corresponding to matrix multiplication, gradient descent calculations or picture processing.
Over time, Python has developed internally to deal with a few of these points by introducing new options to the language, corresponding to multi-threading or rewriting current performance for improved efficiency. Nonetheless, Python’s use of the World Interpreter Lock (GIL) typically hamstrung efforts like this.
Many exterior libraries have additionally been written to bridge this perceived efficiency hole between Python and compiled languages corresponding to Java. Maybe probably the most used and well-known of those is the NumPy library. Carried out within the C language, NumPy was designed from the bottom as much as help a number of CPU cores and super-fast numerical and array processing.
There are alternate options to NumPy, and in a current TDS article, I launched the numexpr library, which, in lots of use instances, may even outperform NumPy. In the event you’re taken with studying extra, I’ll embrace a hyperlink to that story on the finish of this text.
One other exterior library that may be very efficient is Numba. Numba utilises a Simply-in-Time (JIT) compiler for Python, which interprets a subset of Python and NumPy code into quick machine code at runtime. It’s designed to speed up numerical and scientific computing duties by leveraging LLVM (Low-Degree Digital Machine) compiler infrastructure.
On this article, I want to focus on one other runtime-enhancing exterior library, Cython. It’s one of the crucial performant Python libraries but in addition one of many least understood and used. I believe that is no less than partially as a result of you must get your arms a bit of bit soiled and make some modifications to your unique code. However should you comply with the easy four-step plan I’ll define under, the efficiency advantages you possibly can obtain will make it greater than worthwhile.
What’s Cython?
In the event you haven’t heard of Cython, it’s a superset of Python designed to supply C-like efficiency with code written primarily in Python. It permits for changing Python code into C code, which may then be compiled into shared libraries that may be imported into Python identical to common Python modules. This course of ends in the efficiency advantages of C whereas sustaining the readability of Python.
I’ll showcase the precise advantages you possibly can obtain by changing your code to make use of Cython, analyzing three use instances and offering the 4 steps required to transform your current Python code, together with comparative timings for every run.
Organising a improvement setting
Earlier than persevering with, we must always arrange a separate improvement setting for coding to maintain our challenge dependencies separate. I’ll be utilizing WSL2 Ubuntu for Home windows and a Jupyter Pocket book for code improvement. I exploit the UV package deal supervisor to arrange my improvement setting, however be happy to make use of no matter instruments and strategies swimsuit you.
$ uv init cython-test
$ cd cython-test
$ uv venv
$ supply .venv/bin/activate
(cython-test) $ uv pip set up cython jupyter numpy pillow matplotlib
Now, kind ‘jupyter pocket book’ into your command immediate. It’s best to see a pocket book open in your browser. If that doesn’t occur routinely, what you’ll seemingly see is a screenful of data after working the Jupyter Pocket book command. Close to the underside of that, there shall be a URL you need to copy and paste into your browser to provoke the Jupyter Pocket book.
Your URL shall be totally different to mine, but it surely ought to look one thing like this:-
http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69d
Instance 1 – Dashing up for loops
Earlier than we begin utilizing Cython, let’s start with a daily Python operate and time how lengthy it takes to run. This shall be our base benchmark.
We’ll code a easy double-for-loop operate that takes just a few seconds to run, then use Cython to hurry it up and measure the variations in runtime between the 2 strategies.
Right here is our baseline normal Python code.
# sum_of_squares.py
import timeit
# Outline the usual Python operate
def slow_sum_of_squares(n):
whole = 0
for i in vary(n):
for j in vary(n):
whole += i * i + j * j
return whole
# Benchmark the Python operate
print("Python operate execution time:")
print("timeit:", timeit.timeit(
lambda: slow_sum_of_squares(20000),
quantity=1))
On my system, the above code produces the next output.
Python operate execution time:
13.135973724005453
Let’s see how a lot of an enchancment Cython makes of it.
The four-step plan for efficient Cython use.
Utilizing Cython to spice up your code run-time in a Jupyter Pocket book is a straightforward 4-step course of.
Don’t fear should you’re not a Pocket book consumer, as I’ll present find out how to convert common Python .py recordsdata to make use of Cython afterward.
1/ Within the first cell of your pocket book, load the Cython extension by typing this command.
%load_ext Cython
2/ For any subsequent cells that comprise Python code that you simply want to run utilizing cython, add the %%cython magic command earlier than the code. For instance,
%%cython
def myfunction():
and so forth ...
...
3/ Perform definitions that comprise parameters should be appropriately typed.
4/ Lastly, all variables should be typed appropriately by utilizing the cdef directive. Additionally, the place it is sensible, use features from the usual C library (out there in Cython utilizing the from libc.stdlib directive).
Taking our unique Python code for example, that is what it must appear like to be able to run in a pocket book utilizing cython after making use of all 4 steps above.
%%cython
def fast_sum_of_squares(int n):
cdef int whole = 0
cdef int i, j
for i in vary(n):
for j in vary(n):
whole += i * i + j * j
return whole
import timeit
print("Cython operate execution time:")
print("timeit:", timeit.timeit(
lambda: fast_sum_of_squares(20000),
quantity=1))
As I hope you possibly can see, the fact of changing your code is way simpler than the 4 procedural steps required may counsel.
The runtime of the above code was spectacular. On my system, this new cython code produces the next output.
Cython operate execution time:
0.15829777799808653
That’s an over 80x speed-up.
Instance 2 — Calculate pi utilizing Monte Carlo
For our second instance, we’ll study a extra advanced use case, the muse of which has quite a few real-world purposes.
An space the place Cython can present vital efficiency enchancment is in numerical simulations, significantly these involving heavy computation, corresponding to Monte Carlo (MC) simulations. Monte Carlo simulations contain working many iterations of a random course of to estimate the properties of a system. MC applies to all kinds of research fields, together with local weather and atmospheric science, pc graphics, AI search and quantitative finance. It’s virtually all the time a really computationally intensive course of.
For example, we’ll use Monte Carlo in a simplified method to calculate the worth of Pi. It is a well-known instance the place we take a sq. with a facet size of 1 unit and inscribe 1 / 4 circle inside it with a radius of 1 unit, as proven right here.

The ratio of the realm of the quarter circle to the realm of the sq. is, clearly, (Pi/4).
So, if we take into account many random (x,y) factors that every one lie inside or on the bounds of the sq., as the full variety of these factors tends to infinity, the ratio of factors that lie on or contained in the quarter circle to the full variety of factors tends in the direction of Pi /4. We then multiply this worth by 4 to acquire the worth of Pi itself.
Right here is a few typical Python code you may use to mannequin this.
import random
import time
def monte_carlo_pi(num_samples):
inside_circle = 0
for _ in vary(num_samples):
x = random.uniform(0, 1)
y = random.uniform(0, 1)
if (x**2) + (y**2) <= 1:
inside_circle += 1
return (inside_circle / num_samples) * 4
# Benchmark the usual Python operate
num_samples = 100000000
start_time = time.time()
pi_estimate = monte_carlo_pi(num_samples)
end_time = time.time()
print(f"Estimated Pi (Python): {pi_estimate}")
print(f"Execution Time (Python): {end_time - start_time} seconds")
Operating this produced the next timing outcome.
Estimated Pi (Python): 3.14197216
Execution Time (Python): 20.67279839515686 seconds
Now, right here is the Cython implementation we get by following our four-step course of.
%%cython
import cython
import random
from libc.stdlib cimport rand, RAND_MAX
@cython.boundscheck(False)
@cython.wraparound(False)
def monte_carlo_pi(int num_samples):
cdef int inside_circle = 0
cdef int i
cdef double x, y
for i in vary(num_samples):
x = rand() / RAND_MAX
y = rand() / RAND_MAX
if (x**2) + (y**2) <= 1:
inside_circle += 1
return (inside_circle / num_samples) * 4
import time
num_samples = 100000000
# Benchmark the Cython operate
start_time = time.time()
pi_estimate = monte_carlo_pi(num_samples)
end_time = time.time()
print(f"Estimated Pi (Cython): {pi_estimate}")
print(f"Execution Time (Cython): {end_time - start_time} seconds")
And right here is the brand new output.
Estimated Pi (Cython): 3.1415012
Execution Time (Cython): 1.9987852573394775 seconds
As soon as once more, that’s a fairly spectacular 10x speed-up for the Cython model.
One factor we did on this code instance that we didn’t within the different is import some exterior libraries from the C normal library. That was the road,
from libc.stdlib cimport rand, RAND_MAX
The cimport command is a Cython key phrase used to import C features, variables, constants, and kinds. We used it to import optimised C language variations of the equal random.uniform() Python features.
Instance 3— picture manipulation
For our last instance, we’ll do some picture manipulation. Particularly, some picture convolution, which is a typical operation in picture processing. There are lots of use instances for picture convolution. We’re going to make use of it to attempt to sharpen the marginally blurry picture proven under.

First, right here is the common Python code.
from PIL import Picture
import numpy as np
from scipy.sign import convolve2d
import time
import os
import matplotlib.pyplot as plt
def sharpen_image_color(picture):
# Begin timing
start_time = time.time()
# Convert picture to RGB in case it is not already
picture = picture.convert('RGB')
# Outline a sharpening kernel
kernel = np.array([[0, -1, 0],
[-1, 5, -1],
[0, -1, 0]])
# Convert picture to numpy array
image_array = np.array(picture)
# Debugging: Test enter values
print("Enter array values: Min =", image_array.min(), "Max =", image_array.max())
# Put together an empty array for the sharpened picture
sharpened_array = np.zeros_like(image_array)
# Apply the convolution kernel to every channel (assuming RGB picture)
for i in vary(3):
channel = image_array[:, :, i]
# Carry out convolution
convolved_channel = convolve2d(channel, kernel, mode='identical', boundary='wrap')
# Clip values to be within the vary [0, 255]
convolved_channel = np.clip(convolved_channel, 0, 255)
# Retailer again within the sharpened array
sharpened_array[:, :, i] = convolved_channel.astype(np.uint8)
# Debugging: Test output values
print("Sharpened array values: Min =", sharpened_array.min(), "Max =", sharpened_array.max())
# Convert array again to picture
sharpened_image = Picture.fromarray(sharpened_array)
# Finish timing
length = time.time() - start_time
print(f"Processing time: {length:.4f} seconds")
return sharpened_image
# Appropriate path for WSL2 accessing Home windows filesystem
image_path = '/mnt/d/photos/taj_mahal.png'
picture = Picture.open(image_path)
# Sharpen the picture
sharpened_image = sharpen_image_color(picture)
if sharpened_image:
# Present utilizing PIL's built-in present technique (for debugging)
#sharpened_image.present(title="Sharpened Picture (PIL Present)")
# Show the unique and sharpened photos utilizing Matplotlib
fig, axs = plt.subplots(1, 2, figsize=(15, 7))
# Authentic picture
axs[0].imshow(picture)
axs[0].set_title("Authentic Picture")
axs[0].axis('off')
# Sharpened picture
axs[1].imshow(sharpened_image)
axs[1].set_title("Sharpened Picture")
axs[1].axis('off')
# Present each photos facet by facet
plt.present()
else:
print("Did not generate sharpened picture.")
The output is that this.
Enter array values: Min = 0 Max = 255
Sharpened array values: Min = 0 Max = 255
Processing time: 0.1034 seconds

Let’s see if Cython can beat that run time of 0.1034 seconds.
%%cython
# cython: language_level=3
# distutils: define_macros=NPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION
import numpy as np
cimport numpy as np
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def sharpen_image_cython(np.ndarray[np.uint8_t, ndim=3] image_array):
# Outline sharpening kernel
cdef int kernel[3][3]
kernel[0][0] = 0
kernel[0][1] = -1
kernel[0][2] = 0
kernel[1][0] = -1
kernel[1][1] = 5
kernel[1][2] = -1
kernel[2][0] = 0
kernel[2][1] = -1
kernel[2][2] = 0
# Declare variables outdoors of loops
cdef int top = image_array.form[0]
cdef int width = image_array.form[1]
cdef int channel, i, j, ki, kj
cdef int worth
# Put together an empty array for the sharpened picture
cdef np.ndarray[np.uint8_t, ndim=3] sharpened_array = np.zeros_like(image_array)
# Convolve every channel individually
for channel in vary(3): # Iterate over RGB channels
for i in vary(1, top - 1):
for j in vary(1, width - 1):
worth = 0 # Reset worth at every pixel
# Apply the kernel
for ki in vary(-1, 2):
for kj in vary(-1, 2):
worth += kernel[ki + 1][kj + 1] * image_array[i + ki, j + kj, channel]
# Clip values to be between 0 and 255
sharpened_array[i, j, channel] = min(max(worth, 0), 255)
return sharpened_array
# Python a part of the code
from PIL import Picture
import numpy as np
import time as py_time # Renaming the Python time module to keep away from battle
import matplotlib.pyplot as plt
# Load the enter picture
image_path = '/mnt/d/photos/taj_mahal.png'
picture = Picture.open(image_path).convert('RGB')
# Convert the picture to a NumPy array
image_array = np.array(picture)
# Time the sharpening with Cython
start_time = py_time.time()
sharpened_array = sharpen_image_cython(image_array)
cython_time = py_time.time() - start_time
# Convert again to a picture for displaying
sharpened_image = Picture.fromarray(sharpened_array)
# Show the unique and sharpened picture
plt.determine(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.imshow(picture)
plt.title("Authentic Picture")
plt.subplot(1, 2, 2)
plt.imshow(sharpened_image)
plt.title("Sharpened Picture")
plt.present()
# Print the time taken for Cython processing
print(f"Processing time with Cython: {cython_time:.4f} seconds")
The output is,

Each applications carried out effectively, however Cython was practically 25 instances quicker.
What about working Cython outdoors a Pocket book setting?
Thus far, every little thing I’ve proven you assumes you’re working your code inside a Jupyter Pocket book. The rationale I did that is that it’s the simplest solution to introduce Cython and get some code up and working shortly. Whereas the Pocket book setting is extraordinarily in style amongst Python builders, an enormous quantity of Python code continues to be contained in common .py recordsdata and run from a terminal utilizing the Python command.
If that’s your major mode of coding and working Python scripts, the %load_ext and %%cython IPython magic instructions gained’t work since these are solely understood by Jupyter/IPython.
So, right here’s find out how to adapt my four-step Cython conversion course of should you’re working your code as a daily Python script.
Let’s take my first sum_of_squares instance to showcase this.
1/ Create a .pyx file as an alternative of utilizing %%cython
Transfer your Cython-enhanced code right into a file named, for instance:-
sum_of_squares.pyx
# sun_of_squares.pyx
def fast_sum_of_squares(int n):
cdef int whole = 0
cdef int i, j
for i in vary(n):
for j in vary(n):
whole += i * i + j * j
return whole
All we did was take away the %%cython directive and the timing code (which is able to now be within the calling operate)
2/ Create a setup.py file to compile your .pyx file
# setup.py
from setuptools import setup
from Cython.Construct import cythonize
setup(
identify="cython-test",
ext_modules=cythonize("sum_of_squares.pyx", language_level=3),
py_modules=["sum_of_squares"], # Explicitly state the module
zip_safe=False,
)
3/ Run the setup.py file utilizing this command,
$ python setup.py build_ext --inplace
working build_ext
copying construct/lib.linux-x86_64-cpython-311/sum_of_squares.cpython-311-x86_64-linux-g
4/ Create a daily Python module to name our Cython code, as proven under, after which run it.
# primary.py
import time, timeit
from sum_of_squares import fast_sum_of_squares
begin = time.time()
outcome = fast_sum_of_squares(20000)
print("timeit:", timeit.timeit(
lambda: fast_sum_of_squares(20000),
quantity=1))
$ python primary.py
timeit: 0.14675087109208107
Abstract
Hopefully, I’ve satisfied you of the efficacy of utilizing the Cython library in your code. Though it might sound a bit sophisticated at first sight, with a bit of effort, you may get unimaginable efficiency enhancements to your run instances over utilizing common Python, even when utilizing quick numerical libraries corresponding to NumPy.
I supplied a four-step course of to transform your common Python code to make use of Cython for working inside Jupyter Pocket book environments. Moreover, I defined the steps required to run Cython code from the command line outdoors a Pocket book setting.
Lastly, I bolstered the above by showcasing examples of changing common Python code to make use of Cython.
Within the three examples I confirmed, we achieved good points of 80x, 10x and 25x speed-ups, which isn’t too shabby in any respect.
As promised, here’s a hyperlink to my earlier TDS article on utilising the numexpr library to speed up Python code.