Speeding up Python using Cython

If you've worked with Python before, you probably know that it makes a trade-off between developer productivity and speed of execution. To keep you as productive as possible, it does a lot of stuff behind the scenes. But this "stuff" comes at a cost.

From within the language, there are mainly 2 ways of speeding up performance critical pieces of code. You can either use the Python/C API, write your code as a C extension, compile it down to machine code, and then use it in Python. Alternatively, you could write the performance critical functions in pure C (or in another compiled language), compile them to a shared library and use ctypes to access them in Python space.

Both the options work really well. The only problem is, you need to write C. And when you write C, you immediately lose the productivity you gained when you switched to Python.

Fortunately, there's a third option called Cython, that lets you write Python-ish code without having to sacrifice performance.

This blog post will focus on how I used Cython to improve the performance of an open source package I recently wrote by upto 95%.

Background

The package in concern is streaming-form-data. Mostly intended to be used in Python web frameworks - specifically request handlers (views) that handle file uploads - it's a small library that lets you parse multipart/form-data encoded content in a streaming manner. This means that if you're accepting file uploads from an HTML <form>, you don't need to wait for the entire file to be loaded into memory before doing any work. You can basically work on the input chunk by chunk.

Problem

The way the parser works is that it looks at each byte in the request body, looking for occurences of the boundary string, which is specified in the Content-Disposition HTTP header (there's a W3C article that describes this in much more detail). An occurence of such a boundary string means that the data for one input field has ended, after which either data for the next input field can start, or it signifies end of content (in which case the boundary is slightly different, but for now this level of abstraction should work).

The way this "boundary finding" works is through a utility Finder class. Its usage looks something like the following.

from streaming_form_data._parser import Finder

finder = Finder(b'needle')

for byte in haystack:
    finder.feed(byte)

    if finder.found:
        print('Found!')
        finder.reset()
        break

It takes the string to be found as a parameter, defines a feed(byte) function to feed a single byte from the haystack, and a found property which tells you whether or not it has found what it was looking for.

Since objects from this class are being used by the parser to determine whether or not a boundary has been found, you can imagine that the feed function is going to be called for every single byte from the input. If the input is large, that's a lot of calls! And of course, it shows! My initial implementation took more than 3 seconds to parse a ~300kb file, maxing out the CPU the entire time. 🤒

Cython

Cython is an optimizing static compiler that compiles your Python code into C, which you can further compile down to machine code to achieve basically native performance. Cython is a superset of Python - this means that you can either write your performance critical code in Python and then compile it using Cython, or you could write your code using additional C-like constructs that Cython offers and then do the "Python -> C -> machine code" dance. Either way, it's fascinating to see the amount of C code that Cython saves you from writing.

Since Cython is a superset of Python, I decided to try it out. I moved the performance critical pieces of code to Cython, and wrote a wrapper class calling the Cython code to do the actual parsing. And the results were stunning.

For the purposes of this blog post, I used the following function to benchmark the Python code v/s the Cython code.

from requests_toolbelt import MultipartEncoder

def _parse(Parser, filename, content_type)
    with open(filename, 'rb') as fd:
        fields = {
            filename: (filename, fd, content_type)
        }

        encoder = MultipartEncoder(fields=fields)

        parser = CythonParser(headers={'Content-Type': encoder.content_type})
        parser.data_received(encoder.to_string())

def py_parse(filename, content_type):
    _parse(PyParser, filename, content_type)

def cython_parse(filename, content_type):
    _parse(CythonParser, filename, content_type)

On a 200 bytes text file, running timeit gives the following results.

In [9]: %timeit py_parse('file.txt', 'application/text)
1.62 ms ± 3.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit cython_parse('file.txt', 'application/text')
219 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Nice. That's already a roughly 86% reduction in run time.

What happens on an 11kb image?

In [7]: %timeit py_parse('image-11k.png', 'image/png')
63.7 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit cython_parse('image-11k.png', 'image/png')
2.35 ms ± 45 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That's a 96% reduction in time - also very cool!

Now let's increase the file size by a bit more and try out a 0.5mb image.

In [11]: %timeit py_parse('image-500k.png', 'image/png')
2.33 s ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: %timeit cython_parse('image-500k.png', 'image/png')
80.8 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

While the pure Python version takes more than 2 seconds, the Cython version finishes the job in less than 100ms. This is very impressive.

Out of curiousity, I tried throwing a 10mb file at it. I waited for 1 minute before killing the %timeit py_parse('image-10m.jpg', 'image/jpg') call, while the Cythonized version shows the following results:

In [17]: %timeit cython_parse('image-10m.jpg', 'image/jpg')
1.43 s ± 42.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Sure, 1.43s for a single 10mb file is still way too much. But this I feel has nothing do with Cython, and is a shortcoming in the library itself. Like with any other piece of code, there's lots of work to be done before calling it a day (pull requests welcome!).

Besides, the point of this blog post was to show what kind of speedups Cython can bring to your code. 🙂

Summary

I'm fairly impressed by Cython. Not just because it's a superset of Python and is hence very easy to write and integrate. But also, it's fascinating how much amount of C code it saves you from writing. To get an idea, create a Python file called hello.py, define a function that prints out "Hello, World!", run cython hello.py, and open up hello.c in an editor.

💯/💯 would use again. 🙂