or, Adventures in the Python Bytecode Compiler, episode 1

It’s been quiet around here lately…almost too quiet. I’ve spent the last 2.5 weeks (minus time spent battling the flu) investigating possibilities for optimising Python’s bytecode compiler and interpreter. One of the areas uncovered was the bytecode-level implementation of list comprehensions, which were being translated to some pretty hideous stuff.

[x for x in range(10) if x % 2]

translates to

      0 BUILD_LIST               0
      3 DUP_TOP
      4 LOAD_ATTR                0 (append)
      7 STORE_NAME               1 (_[1])
     10 LOAD_NAME                2 (range)
     13 LOAD_CONST               0 (10)
     16 CALL_FUNCTION            1
     19 GET_ITER
>>   20 FOR_ITER                31 (to 54)
     23 STORE_NAME               3 (x)
     26 LOAD_NAME                3 (x)
     29 LOAD_CONST               1 (2)
     32 BINARY_MODULO
     33 JUMP_IF_FALSE           14 (to 50)
     36 POP_TOP
     37 LOAD_NAME                1 (_[1])
     40 LOAD_NAME                3 (x)
     43 CALL_FUNCTION            1
     46 POP_TOP
     47 JUMP_ABSOLUTE           20
>>   50 POP_TOP
     51 JUMP_ABSOLUTE           20
>>   54 DELETE_NAME              1 (_[1])

The part I didn’t like was how the list was being built up. At the beginning, a bound append method on an empty list is stowed away, then invoked with each succcessive new element. That means that every element incurs the overhead of some extra stack manipulation and a method resolution+invocation (which isn’t cheap).

My solution was to resurrect an unused opcode, LIST_APPEND. Rather than take its list argument off the stack, it now takes an integer argument, indicating the number of items to look down the stack. This avoids a lot of unnecessary stack manipulation and function call overhead, since we end up calling PyList_Append directly.

The example from above, rewritten to use the new LIST_APPEND opcode:

      0 BUILD_LIST               0
      3 LOAD_NAME                0 (range)
      6 LOAD_CONST               0 (10)
      9 CALL_FUNCTION            1
     12 GET_ITER
>>   13 FOR_ITER                27 (to 43)
     16 STORE_NAME               1 (x)
     19 LOAD_NAME                1 (x)
     22 LOAD_CONST               1 (2)
     25 BINARY_MODULO
     26 JUMP_IF_FALSE           10 (to 39)
     29 POP_TOP
     30 LOAD_NAME                1 (x)
     33 LIST_APPEND              2
     36 JUMP_ABSOLUTE           13
>>   39 POP_TOP
     40 JUMP_ABSOLUTE           1

So, what does this change buy us? Besides shorter generated bytecode, a vastly cleaner C implementation, list comprehensions are now over 16% faster [1]. I’ve submitted this in patch form to SourceForge; hopefully it can make it into Python before 2.5 is released.


[1] - Benchmarking was done by taking all the listcomp examples from Lib/test/test_grammar.py and looping over them a few tens of thousands of times. The old version took 1.69 seconds to run; the new version took 1.41. The testing loop was executed 10000 times each trial.

Update - the patch was rejected. The compiler was apparently supposed to be doing something similar to this already, but that functionality had been left out when the new AST branch was merged in; it’s been reinstated.