testing


I know what you’re thinking: “what the hell? You can’t subclass modules!” Conventional wisdom == wrong.

import os

class MyOS(os):
    __metaclass__ = ModuleMeta

    def lstat(self, arg):
        return 6

    def rmdir(self, arg):
        raise self.error("No such file or directory: %r" % arg)

Notice that we’re apparently subclassing a module. The metaclass will allow us to override whichever of the module’s functions we desire, leaving the others intact.

class ModuleMeta(type):
    def __new__(cls, name, bases, d):
        d["__getattr__"] = lambda x, y: getattr(bases[0], y)
        return type.__new__(cls, name, (object,), d)

This is the little beauty that makes the whole thing possible. Here, we stick a custom __getattr__() function into the class’s namespace, then replace the incoming bases tuple with our own. The bases we were passed will contain a module, and that will cause the runtime to complain if the module reaches type.__new__().

Some client code:

os = MyOS()
print os.lstat("foo")
print os.times()
os.rmdir("foo")

Our custom os-alike provides its own rmdir() and lstat() functions while using the times() function from the real os module. This works in Python 2.3, 2.4 and 2.5.

I see requests for this fairly regularly when people are wanting to stub out certain functions in a module for testing purposes. Of course, the easy way to do this isn’t to subclass the module at all: just create a class that does what you want.

class MyOS:
    def lstat(self, arg):
        return 6

    def rmdir(self, arg):
        raise self.error("No such file or directory: %r" % arg)

    def __getattr__(self, attr):
        return getattr(os, attr)

No fuss, no muss, and it’s fully equivalent to the above magic metaclass incantations. I’ll talk more about this in a future post.

So Tyler says to me, he says:

I went to the Nashville PHP group last night. The conversation turned to which languages are on the rise, and I threw Python into the mix. Problem is, I had very little ammo to arm myself with. Got a list of bullet points as to why Python is better?

Well, yes and no.

In terms of functionality, there’s very little difference between Perl 5, Python, PHP and Ruby. The reasons to choose one over the other are typically very domain-specific (hence subtle and of little use when fighting religious wars): Perl 5 makes text munging simple by having, e.g., regular expressions as first-class citizens; PHP makes web applications more natural because, well, that’s what it was designed to do.

I have nothing really positive (or negative) to say about Ruby. I can’t think of any special niche that it fills. Anonymous blocks? Perl 5 has them. Pure OO? Python has it. call/cc? If you think you need continuations, you probably don’t. You could argue that Ruby serves a purpose by combining all these things, but the number of people who sincerely need a pure OO language with anonymous blocks and continuations is probably around five.

The negative things I can think of with respect to Perl 5 and PHP is that it’s hard to do dependency injection-based testing in these languages. It’s so hard in Java, for example, even Google has invented a tool to make Java DI easier. Python on the other hand makes this dead-simple, making it so much easier to test your code from all perspectives. Hell, it’s so easy in Python, I didn’t even know there was a name for it until I came to Google. I don’t know how easy DI is in Ruby, but if it’s not Python-easy, Ruby loses.

That’s one criterion for programming languages that I don’t see discussed much: ranking languages by how easy the code is to test. One frequent example is mocking a global resource like a time source. C, C++ and Java all require you to come up with unnatural function signatures or link against special libraries when testing in order to gain control over time. It’s easier in Perl 5, but it still requires a good deal of specialized knowledge of how namespaces and module lookups work. Assuming the target library does something like import time at the top, here’s how you take control of a given module’s time source in Python:

>>> import some_module
>>> class StubTime:
>>>    def time(self):
>>>        return 3634634
>>> some_module.time = StubTime()

Done. No specialized knowledge of interpreter details, no crazy setup, just done. If mocking global resources isn’t that easy in PHP, Ruby or any other language, I have little use for it beyond toy projects. Testing is where I feel Python really stands out.

Two weeks or so ago, I brought up my unittest redesign on the new testing-in-python mailing list. A number of people were upset that in redesigning unittest, I had rejected nose and py.test; Titus Brown even wrote a few blog posts on the subject, in particular taking me to task for ignoring nose.

I’ll be honest: when I started redesigning unittest, I did ignore nose and py.test. I remembered looking at them a long time ago, when I was first getting frustrated with unittest, casting around for a better, more flexible alternative. py.test has no support for extensions and depends on the rest of the py library, so that’s out. nose has plugins, but my general impression was that it’s just a nice test discovery tool; since that wasn’t what I was looking for, I didn’t care. Thinking that perhaps the project has changed significantly since the last time I looked at it, I took another, closer look at nose’s infrastructure. Verdict: it’s still a nice test discovery tool, but since that’s still not what I’m looking for, I still don’t care.

And now we will have a brief intermezzo, and I will explain exactly why I’m redesigning unittest.

First of all, I didn’t start off with the intention of rewriting the whole module. I began by trying to change the existing design so that it would be easier to compose extensions. So I poked and I tweaked and prodded and twisted unittest until it was unrecognizable, until I was left with something that resembled the old version in name only. That is to say: this didn’t start out as a rewrite — it just ended up that way.

Now, what do I mean when I say “composing extensions”? Yes, unittest as-shipped allows you to extend its functionality by way of subclassing this bit and that bit, but the problem comes when trying to mash two extensions together: you can’t. You can’t put your unittest extensions — say, one that does refcount checking for C extensions or one that writes test results to a database — up on PyPI and have people be able to mix and match to create just the right testing environment for their project.

This all has one major design implications for your testing framework: extensions must operate without knowing anything about what other extensions might be running. The framework has to be designed so that extensions can operate by themselves just as well as they do with 15 others.

nose doesn’t come anywhere close to supporting this.

(Note: the following is based on my best understanding of nose’s codebase and on conversations with others. If I’ve gotten anything wrong, please let me know and I’ll gladly retract it.)

“That’s crap,” you say, “nose has plugins!” Ha. nose plugins don’t come anywhere close to achieving this level of independence. If I want to add a plugin to allow tests to be marked as TODO, there’s no way for this new kind of test-status to make its way into the various reporting plugins. As far as I can tell, just to get TODO tests not to show up as failures in the default console output, I’d have to:

  • Subclass nose.result.TextTestResult, overriding addError() so that it picks up the TODO-ness of the test.

  • Subclass nose.core.TextTestRunner, overriding _makeResult() so that it uses my TextTestResult subclass.

  • Subclass nose.core.TestProgram, overriding runTests() so that it uses my TextTestRunner subclass.

  • Replace nose.core.run() with a function that uses my TestProgram subclass.

    Of course, by the time my plugin is running and trying to do all this subclassing/replacing malarkey, nose.core.run() has already been called, so it’s too late.

By contrast, adding this kind of support to my unittest redesign is trivial. Omitting the TODO() decorator and exception classes (which you’d need for the nose version, too):

class TodoRunner(TestRunner):
  categories = ['todo pass', 'todo fail']

  def handle_exception(self, test, exc_info):
    exc_type = exc_info[0]
    if issubclass(exc_type, TodoPassed):
      self.log_exception('todo pass', test, exc_info)
    elif issubclass(exc_type, TodoFailed):
      self.log_exception('todo fail', test, exc_info)
    else:
      super(TodoRunner, self).handle_exception(test, exc_info)

  def was_successful(self):
    parent_success = super(TodoRunner, self).was_successful()
    return parent_success and not self.still_todo()

  def still_todo(self):
    return self.exceptions['todo pass'] 
           or self.exceptions['todo fail']

  def failure_label(self):
    if self.still_todo():
      return 'TODO'
    return super(TodoRunner, self).failure_label()

With those lines of code, all output extensions — console, database, XML, etc — will automatically recognize TODO tests and treat them as such. No fuss, no muss.

Now, all this isn’t to say that nose is crap. What I said earlier is still true: nose is a good test discovery tool. I even hope to borrow some of its discovery strategies for the new design. What nose is not, however, is an ultra-flexible test environment framework where extensions can be shared easily and openly, and that’s what I’m going for.

Following up on an earlier post, I’ve just submitted a trio of patches for Python’s unittest module to SourceForge:

  • Patch #1550272 is the test suite itself. It comprises 128 tests for the mission-critical parts of unittest.

  • Patch #1550273 fixes 6 issues uncovered while writing the test suite. Several other items that I raised earlier were judged to be either non-issues or behaviours that, while suboptimal, people have come to rely on.

  • Patch #1550263 follows up on an earlier patch I submitted for unittest’s docs. This new patch corrects and clarifies numerous sections of the module’s documentation.

I’m hopeful that these changes will make it into Python 2.5-final or 2.5.1 at the latest.

Here’s a list of the issues I uncovered while writing the test suite:

  1. TestLoader.loadTestsFromName() failed to return a suite when resolving a name to a callable that returns a TestCase instance.

  2. Fix a bug in both TestSuite.addTest() and TestSuite.addTests() concerning a lack of input checking on the input test case(s)/suite(s).

  3. Fix a bug in both TestLoader.loadTestsFromName() and TestLoader.loadTestsFromNames() that had ValueError being raised instead of TypeError. The problem occured when the given name resolved to a callable and the callable returned something of the wrong type.

  4. When a name resolves to a method on a TestCase
    subclass, TestLoader.loadTestsFromName() did not return
    a suite as promised.

  5. TestLoader.loadTestsFromName() would raise a ValueError (rather than a TypeError) if a name resolved to an invalid object. This has been fixed so that a TypeError is raised.

  6. TestResult.shouldStop was being initialised to 0 in TestResult.__init__. Since this attribute is always used in a boolean context, it’s better to use the False
    spelling.

As promised, and prompted in part by a recent post by Brett Cannon, here’re my thoughts on why unittest sucks.

Reading the docs for unittest, you’d think it would be easily extensible. You see things like TestCase.defaultTestResult(), the many overridable methods on TestResult objects, the apparent flexibility of TextTestRunner, and you get it in your head that it should be pretty easy to make it do whatever you want.

Armed with this impression and your Python-foo, you set off to write a unittest extension that will let you mark certain tests as “TODO”. You want the test harness to count these tests differently than normal tests: TODO tests are supposed to fail, and you want to be notified they start unexpectedly passing.

You tinker a bit, you poke and you prod, and you wind up with your extension, and the whole thing works great. You just wish you hadn’t had to subclass _TestTextResult, TestCase and TextTestRunner to get the job done. You feel like it could have been easier, but you don’t pay it much mind. After all, you only needed the one extension.

A few months later, a different project has a need to run reference-count checks around each test case for a C extension module. Confident from your first experience extending unittest, you head back into the code. A little later, you emerge, bearing the shiny new reference count-checking extension to unittest. You again ended up subclassing _TestTextResult, TestCase and TextTestRunner, but again, it’s just one extension.

An hour later, your boss walks by and says that the ref-counting extension and the TODO extension need to be combined so they can be used together on a new project. No problem, you say; composing the two should be cake.

That thought lasts about as long as it takes to load the extensions in your editor of choice.

unittest might have been intended to be extended, but only in simple ways, and only by one extension at a time. I’ll save you the suspense; to combine the above extensions, you have to write a completely different third extension, which attempts to merge the two functionality sets as much as possible. You want to incorporate another extension, say one that logs the test results to a database? Tough luck.

unittest’s design is fundamentally broken. Little or no attempt was made to separate the different concerns at work here: TestCase instances can determine what result logger to use and how exceptions are to be interpreted. Making TextTestRunner use a subclass of _TestTextResult means subclassing the runner object. TestResult is responsible for converting tracebacks to a textual representation, even though this means that any result classes that want to do introspect the tracebacks end up completely rewriting much of TestResult in the process.

That’s the problem; next time, the solution.

In developing the common test harness for my functional, svnmock and the standard Python module unittest, mainly with respect to how hard it is to combine different extensions to unittest’s TestCase, TestRunner and TestResult classes.

So, I head off to start poking around in unittest’s internals, to see what mucking around I can do in order to be able to compose extensions the way I want to. (Exactly what I’m trying to achieve will be the subject of another post.) I find — to my great horror — that unittest’s test suite consists of the following code:

import unittest

def test_TestSuite_iter():
    '''
    >>> test1 = unittest.FunctionTestCase(lambda: None)
    >>> test2 = unittest.FunctionTestCase(lambda: None)
    >>> suite = unittest.TestSuite((test1, test2))
    >>> tests = []
    >>> for test in suite:
    ...     tests.append(test)
    >>> tests == [test1, test2]
    True
    '''

How’s that for irony.

So: I’ve now spent three or four days going over unittest’s documentation and code, building up a test suite as I go. Work is proceeding approximately like so:

  1. Cleaning up the documentation. The old docs were full of typos and grammatical problems, not to mention the blatant factual errors and omissions.

    The bulk of this work is already done: a patch for the docs was accepted and applied to Python’s SVN repository as r51123.

  2. Write the test suite. As it stands, I’m up to 121 tests, giving me 60% test coverage, according to figleaf (which proved gratifyingly easy to integrate into the test suite). The bulk of those tests are for unittest.TestLoader, particularly its loadTestsFromName() and loadTestsFromNames() methods.

  3. Fix the bugs. Thus far, I’ve uncovered 23 bugs in unittest; some of these are clear-cut and easy to fix, while others will require discussion on python-dev. In addition, I’ve got 14 test cases for functionality that unittest should have had from the beginning, but doesn’t; these will have to wait until Python 2.6.