Apr 7, 2015
Python has grown to be my preferred choice when it comes to programing. It is just so simple to import from its huge selection of modules and just build upon them to get things working. But at some point, when all the design phase is over, it just feels so slow when it has to work.
In this case, I’m referring to numerical calculations. Indeed the numpy and scipy modules are of great help, letting all the numerical calculations to run faster in their C or Fortran coded engines. But what of the algorithm I code in Python, it is just going to be the slow part of my work. So I have to avoid writing too much python and prefer calling all this C coded modules. YES!!, this is just it. BUT is not enough, because numpy and scipy only provide very general functions and it is up to the user to make a use out of them.
So once the user has its algorithm working, he needs to speed it up. Coding it all into C might be too much of a hassle and error prone. He also might loose many of the python modules he depends on in the first place and the code will become less readable. The alternatives for now are to use Cython and the not so new project numba, this two are designed to establish the link between python ease of use and C level performance. My problem is that all the simple examples and tutorials you get from them are to highlight the huge performance boost you can get don’t really match my real life situations. Their examples are based to situations where the complete algorithm your is stand alone and the python interpreter is slowing this down. So a quick call to numba or Cython static typesetting will just solve the problem and give you 3 orders of magnitude performance boost.
My story begins because I was already calling the fast C functions of numpy and scipy, and those, should be the most time consuming functions of my algorithm. But they weren’t, there was enough python code to slow it down. So my first hope was to call numba, but there was no performance gain as it can’t deal with my calls to scipy functions. So Cython was the way to go, I’m certainly not aware of the many possibilities Cython permits to speed up my code, I still believe there is room for improvement, and I would be happy to hear from them in the comments.
So I now present you how I solved my problem and gained a performance boost of 4x. Indeed it is not so glamorous or amazing as 1000x you get from the advertised Cython version of the code, but it is a new start in performance gains over my python code.
So before I start I will stress: HAVE TESTS FOR YOUR CODE! It doesn’t help at all to get faster execution times just to arrive to the wrong results faster. Debugging is a pain, and it is more painful every time your new modifications break old code. Testing lets you test over smaller pieces of your code and make it easier to track what broke your code. I love python so much because there are very simple frameworks to write your tests, so just use them. When is a good time to write your test? If you can before your own module great. After you wrote the module is also good, and even if it gave you the right results from the beginning you have to keep track it continues to do so as your code evolves.
The goal of this work is to implement a Quantum Monte Carlo algorithm to solve the Anderson impurity model and use it in the context of DMFT. There are already many implementations of this code spread around the world many of them in Fortran and C/C++ so why not use them? Well, first they are more complicated to understand, modify, and some times to big packages. Second, I needed to learn and modify it on my own style.
The complete software is freely available on github at learn-dmft, matching the first release of the code(because the code itself will evolve as new ideas come in bugs appear and get fixed. I will focus in this post on the local changes to speed up the code. This code snippets are not aimed to run independently but are give in the aim to provide some insight to the work done.
Lets start by profiling the code. Here you see only the relevant parts
I want to focus on for this short example. Here one sees that most of time
is spent inside the
imp_solver which performs the Monte Carlo
that needs to update a dense matrix each time,
gnew. Almost all of the time
would need to be spend in the matrix update, but because I’m using scipy
interface to BLAS it just doesn’t take that much time and it is the Metropolis
update scheme which is manually implemented in python that takes all the time.
Based on this example for the Ising model I wanted to test numba, but it just did not help, because numba can’t, to my knowledge, compile the calls I do to scipy. Static typing in Cython didn’t work either as the main delay was presented by calling still the scipy module of BLAS. So I need to use Cython to perform the direct call into BLAS. It was thanks to tokyo, a Cython wrapper to BLAS that I could learn this. Although scipy seems to be inserting in its development branch the Cython wrappers to BLAS, but that will be for a later time.
Recoding this into Cython syntax it becomes
It is important to keep in mind that numpy normally uses C ordered arrays,
but if you happen to use
scipy.linalg.blas functions in some steps, your numpy arrays will
become Fortran ordered. Then it is important to keep that ordering when
you call BLAS. (This was my big bug, and why you need tests). So pay special
attention of the keyword of
CblasColMajor to keep the Fortran Column ordering
provided by the numpy arrays. The copy instruction is also necessary, and there
should be better ways of data copying and not creating new numpy arrays.
This first action dropped the amount of python calls almost by half. The new Cython function is faster by almost half. The changing number of calls is because I’m a bad profiler and I’m not using the same seed for the random number generator, but in this case is has quite negligible effects. The total execution time dropped by a bit more than a second.
Now it’s time to redo the
update function that is in charge of the Metropolis
algorithm. It has a very simple structure. It uses most of its time calling
the random number generator and doing the algebra operations over simple floats
But once rewritten into Cython it becomes:
Notice that in this case I’m again calling external functions available in C
like the ones of the GSL libraries. After this changes the end result
is a completely cythonized Metropolis Monte Carlo update function that
runs entirely on C. As it is shown by the profiler.
The amount of python function calls dropped again this time by 5 times. And the
update function internally calling
gnew requires in total 0.841s. Droping
the total time used by the impurity solver from 3.591s to 0.841s that is
a 4.12x speed gain. And most of the time would be spend as expected in the
matrix update function