Python
has become my preferred language when it comes to programing. It is so
simple to import from its vast library of packages and then build upon them to
get things working. Yet, at some point, when all the design phase is over, and
the software has to perform it is so slow.
In this case, I’m referring to numerical calculations. Indeed the numpy
and
scipy
modules are of great help, letting all the numerical calculations to run
faster in their C
or Fortran
coded engines. But what of the algorithm I
develop in Python
? It is going to be the slow part of my work. Thus I have to
avoid writing too much Python
and prefer calling all this C
coded modules.
BUT is not enough, because numpy
and scipy
only provide very general
functions and it is up to the user to make a use out of them.
Once the user has its algorithm working, he needs to speed it up. Coding it all
into C
might be too much of a hassle and error prone. He also might loose many
of the Python
packages he depended on in the first place and the code will
become less readable. The alternatives for now are to use Cython
or the not so
new project numba
. These two are designed to establish the bridge between
Python
’s ease of use and C
’s performance. My problem is that all the simple
examples and tutorials you get from them are to highlight the huge performance
boost you can get, yet they don’t really match my real life situations. Their
examples are based on situations where the complete algorithm is stand alone and
the Python
interpreter is slowing everything down. Thus a quick call to
numba
or Cython
’s static typesetting will just solve the problem and give
you 3 orders of magnitude performance boost.
My story begins because I was already calling the fast C
functions
of numpy
and scipy
, and those, should be the most time consuming
functions of my algorithm. But they weren’t, there was enough Python
code
to slow it down. So my first hope was to call numba
, but there was no
performance gain as it can’t deal with my calls to scipy
functions. So
Cython
was the way to go, I’m certainly not aware of the many possibilities
Cython
permits to speed up my code, I still believe there is room for
improvement, and I would be happy to hear from them in the comments.
So I now present you how I solved my problem and gained a performance
boost of 4x. Indeed it is not so glamorous or amazing as 1000x you get
from the advertised Cython
version of the code, but it is a new start
in performance gains over my python code.
So before I start I will stress: HAVE TESTS FOR YOUR CODE!
It doesn’t help at all to get faster execution times just to arrive to
the wrong results faster. Debugging is a pain, and it is more painful
every time your new modifications break old code. Testing lets you test
over smaller pieces of your code and make it easier to track what broke
your code. I love Python
so much because there are very simple frameworks
to write your tests, so just use them. When is a good time to write your
test? If you can, before every function you develop. After you wrote your package
is also good. Even if it produces the correct results from the beginning
you have to keep track it continues to do so as your code evolves.
The goal of this work is to implement a Quantum Monte Carlo algorithm
to solve the Anderson impurity model and use it in the context of DMFT.
There are already many implementations of this code spread around the
world, many of them in Fortran
and C/C++
so why not use them? Well, first
they are more complicated to understand, modify, and some times to big
packages. Second, I needed to learn and modify it on my own style.
The complete software is freely available on github at learn-dmft
, matching
the first release of the code(because the code itself will evolve as new ideas
come in, bugs appear and get fixed. I will focus in this post on the local
changes to speed up the code. These code snippets are not aimed to run
independently but are give in the aim to provide some insight to the work
done.
Lets start by profiling the code. Here you see only the relevant parts
I want to focus on for this short example. Here one sees that most of time
is spent inside the imp_solver
which performs the Monte Carlo update
,
that needs to update a dense matrix each time, gnew
. Almost all of the time
would need to be spend in the matrix update, but because I’m using scipy
’s
interface to BLAS
it just doesn’t take that much time and it is the Metropolis
update scheme which is manually implemented in Python
that takes all the time.
1 1214757 function calls (1212402 primitive calls) in 3.903 seconds
2
3 Ordered by: cumulative time
4
5 ncalls tottime percall cumtime percall filename:lineno(function)
6 1 0.029 0.029 3.591 3.591 hirschfye.py:52(imp_solver)
7 9000 1.603 0.000 3.557 0.000 hirschfye.py:83(update)
8 260114 1.635 0.000 1.879 0.000 hirschfye.py:142(gnew)
9 520228 0.244 0.000 0.244 0.000 {method 'copy' of 'numpy.ndarray' objects
Based on this example for the Ising model
I wanted to test numba
, but it did
not help, because numba
can’t, to my knowledge, compile the calls I do to
scipy
. Static typing in Cython
didn’t work either as the main bottleneck
still originates from calling the scipy
modules of BLAS
. So I need to use
Cython
to perform the direct call into BLAS
. It was thanks to tokyo
, a
Cython
wrapper to BLAS
that I could learn this. Although scipy
seems to be
inserting in its development branch the Cython
wrappers to BLAS
, that will
only become usable at a later time.
1from scipy.linalg.blas import dger
2def gnew(g, v, k, sign):
3 """Quick update of the interacting Green function matrix after a single
4 spin flip of the auxiliary field. It calculates
5
6 .. math:: \alpha = \frac{\exp(2\sigma v_j) - 1}
7 {1 + (1 - G_{jj})(\exp(2\sigma v_j) - 1)}
8 .. math:: G'_{ij} = G_{ij} + \alpha (G_{ik} - \delta_{ik})G_{kj}
9
10 no sumation in the indexes"""
11 dv = sign*v*2
12 ee = np.exp(dv)-1.
13 a = ee/(1. + (1.-g[k, k])*ee)
14 x = g[:, k].copy()
15 x[k] -= 1
16 y = g[k, :].copy()
17 g = dger(a, x, y, 1, 1, g, 1, 1, 1)
Recoding this into Cython syntax it becomes
1cdef extern from "cblas.h":
2 enum CBLAS_ORDER: CblasRowMajor, CblasColMajor
3 void lib_dger "cblas_dger"(CBLAS_ORDER Order, int M, int N, double alpha,
4 double *x, int dx, double *y, int dy,
5 double *A, int lda)
6cimport numpy as np
7import cython
8from libc.math cimport exp
9
10@cython.boundscheck(False)
11@cython.wraparound(False)
12@cython.cdivision(True)
13cpdef gnew(np.ndarray[np.float64_t, ndim=2] g, double v, int k, double sign):
14 cdef double dv, ee, alpha
15 cdef int N = g.shape[0]
16 dv = sign*v*2
17 ee = exp(dv)-1.
18 alpha = ee/(1. + (1.-g[k, k])*ee)
19 cdef np.ndarray[np.float64_t, ndim=1] x = g[:, k].copy()
20 cdef np.ndarray[np.float64_t, ndim=1] y = g[k, :].copy()
21
22 x[k] -= 1.
23 lib_dger(CblasColMajor, N, N, alpha,
24 &x[0], 1, &y[0], 1, &g[0,0], N)
It is important to keep in mind that numpy
normally uses C ordered arrays,
but if you happen to use scipy.linalg.blas
functions in some steps, your
numpy
arrays will become Fortran ordered. Then it is important to keep that
ordering when you call BLAS
. (This was my big bug, and why you need tests). So
pay special attention of the keyword of CblasColMajor
to keep the Fortran
Column ordering provided by the numpy
arrays. The copy instruction is also
necessary, and there should be better ways of data copying and not creating new
numpy
arrays.
1 695841 function calls (693486 primitive calls) in 2.704 seconds
2
3 Ordered by: cumulative time
4
5 ncalls tottime percall cumtime percall filename:lineno(function)
6 1 0.028 0.028 2.389 2.389 hirschfye.py:52(imp_solver)
7 9000 1.273 0.000 2.357 0.000 hirschfye.py:83(update)
8 261426 1.026 0.000 1.026 0.000 {hffast.gnew}
This first action dropped the amount of python calls almost by half. The new
Cython
function is faster by almost half. The changing number of calls is
because I’m a bad profiler and I’m not using the same seed for the random
number generator, but in this case is has quite negligible effects. The total
execution time dropped by a bit more than a second.
Now it’s time to redo the update
function that is in charge of the Metropolis
algorithm. It has a very simple structure. It uses most of its time calling
the random number generator and doing the algebra operations over simple floats
1def update(gup, gdw, v):
2 for j in range(v.size):
3 dv = 2.*v[j]
4 ratup = 1. + (1. - gup[j, j])*(np.exp(-dv)-1.)
5 ratdw = 1. + (1. - gdw[j, j])*(np.exp( dv)-1.)
6 rat = ratup * ratdw
7 rat = rat/(1.+rat)
8 if rat > np.random.rand():
9 v[j] *= -1.
10 gnew(gup, v[j], j, 1.)
11 gnew(gdw, v[j], j, -1.)
But once rewritten into Cython
it becomes:
1cdef extern from "gsl/gsl_rng.h":
2 ctypedef struct gsl_rng_type:
3 pass
4 ctypedef struct gsl_rng:
5 pass
6 gsl_rng_type *gsl_rng_mt19937
7 gsl_rng *gsl_rng_alloc(gsl_rng_type * T)
8 double uniform "gsl_rng_uniform"(gsl_rng *r)
9
10cdef gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937)
11
12@cython.boundscheck(False)
13@cython.wraparound(False)
14@cython.cdivision(True)
15cpdef update(np.ndarray[np.float64_t, ndim=2] gup,
16 np.ndarray[np.float64_t, ndim=2] gdw,
17 np.ndarray[np.float64_t, ndim=1] v):
18 cdef double dv, ratup, ratdw, rat
19 cdef int j, i, up, dw, pair, N=v.shape[0]
20 for j in range(N):
21 dv = 2.*v[j]
22 ratup = 1. + (1. - gup[j, j])*(exp(-dv)-1.)
23 ratdw = 1. + (1. - gdw[j, j])*(exp( dv)-1.)
24 rat = ratup * ratdw
25 rat = rat/(1.+rat)
26 if rat > uniform(r):
27 v[j] *= -1.
28 gnew(gup, v[j], j, 1.)
29 gnew(gdw, v[j], j, -1.)
Notice that in this case I’m again calling external functions available in C
like the ones of the GSL
libraries. After these changes the end result
is a completely cythonized Metropolis Monte Carlo update function that
runs entirely on C
. As it is shown by the profiler.
1 137415 function calls (135060 primitive calls) in 1.182 seconds
2
3 Ordered by: cumulative time
4
5 ncalls tottime percall cumtime percall filename:lineno(function)
6 1 0.024 0.024 0.870 0.870 hirschfye.py:52(imp_solver)
7 9000 0.841 0.000 0.841 0.000 {hffast.update}
The amount of Python
function calls dropped again this time by 5 times. And the
update
function internally calling gnew
requires in total 0.841s
. Dropping
the total time used by the impurity solver from 3.591s
to 0.841s
that is
a 4.12x
speed gain. And most of the time would be spend as expected in the
matrix update function gnew
.