unittest


Two weeks or so ago, I brought up my unittest redesign on the new testing-in-python mailing list. A number of people were upset that in redesigning unittest, I had rejected nose and py.test; Titus Brown even wrote a few blog posts on the subject, in particular taking me to task for ignoring nose.

I’ll be honest: when I started redesigning unittest, I did ignore nose and py.test. I remembered looking at them a long time ago, when I was first getting frustrated with unittest, casting around for a better, more flexible alternative. py.test has no support for extensions and depends on the rest of the py library, so that’s out. nose has plugins, but my general impression was that it’s just a nice test discovery tool; since that wasn’t what I was looking for, I didn’t care. Thinking that perhaps the project has changed significantly since the last time I looked at it, I took another, closer look at nose’s infrastructure. Verdict: it’s still a nice test discovery tool, but since that’s still not what I’m looking for, I still don’t care.

And now we will have a brief intermezzo, and I will explain exactly why I’m redesigning unittest.

First of all, I didn’t start off with the intention of rewriting the whole module. I began by trying to change the existing design so that it would be easier to compose extensions. So I poked and I tweaked and prodded and twisted unittest until it was unrecognizable, until I was left with something that resembled the old version in name only. That is to say: this didn’t start out as a rewrite — it just ended up that way.

Now, what do I mean when I say “composing extensions”? Yes, unittest as-shipped allows you to extend its functionality by way of subclassing this bit and that bit, but the problem comes when trying to mash two extensions together: you can’t. You can’t put your unittest extensions — say, one that does refcount checking for C extensions or one that writes test results to a database — up on PyPI and have people be able to mix and match to create just the right testing environment for their project.

This all has one major design implications for your testing framework: extensions must operate without knowing anything about what other extensions might be running. The framework has to be designed so that extensions can operate by themselves just as well as they do with 15 others.

nose doesn’t come anywhere close to supporting this.

(Note: the following is based on my best understanding of nose’s codebase and on conversations with others. If I’ve gotten anything wrong, please let me know and I’ll gladly retract it.)

“That’s crap,” you say, “nose has plugins!” Ha. nose plugins don’t come anywhere close to achieving this level of independence. If I want to add a plugin to allow tests to be marked as TODO, there’s no way for this new kind of test-status to make its way into the various reporting plugins. As far as I can tell, just to get TODO tests not to show up as failures in the default console output, I’d have to:

  • Subclass nose.result.TextTestResult, overriding addError() so that it picks up the TODO-ness of the test.

  • Subclass nose.core.TextTestRunner, overriding _makeResult() so that it uses my TextTestResult subclass.

  • Subclass nose.core.TestProgram, overriding runTests() so that it uses my TextTestRunner subclass.

  • Replace nose.core.run() with a function that uses my TestProgram subclass.

    Of course, by the time my plugin is running and trying to do all this subclassing/replacing malarkey, nose.core.run() has already been called, so it’s too late.

By contrast, adding this kind of support to my unittest redesign is trivial. Omitting the TODO() decorator and exception classes (which you’d need for the nose version, too):

class TodoRunner(TestRunner):
  categories = ['todo pass', 'todo fail']

  def handle_exception(self, test, exc_info):
    exc_type = exc_info[0]
    if issubclass(exc_type, TodoPassed):
      self.log_exception('todo pass', test, exc_info)
    elif issubclass(exc_type, TodoFailed):
      self.log_exception('todo fail', test, exc_info)
    else:
      super(TodoRunner, self).handle_exception(test, exc_info)

  def was_successful(self):
    parent_success = super(TodoRunner, self).was_successful()
    return parent_success and not self.still_todo()

  def still_todo(self):
    return self.exceptions['todo pass'] 
           or self.exceptions['todo fail']

  def failure_label(self):
    if self.still_todo():
      return 'TODO'
    return super(TodoRunner, self).failure_label()

With those lines of code, all output extensions — console, database, XML, etc — will automatically recognize TODO tests and treat them as such. No fuss, no muss.

Now, all this isn’t to say that nose is crap. What I said earlier is still true: nose is a good test discovery tool. I even hope to borrow some of its discovery strategies for the new design. What nose is not, however, is an ultra-flexible test environment framework where extensions can be shared easily and openly, and that’s what I’m going for.

A long time ago, in a blog post a few pages back in the archives, I spent a few paragraphs bemoaning Python’s unittest module and how it can’t be readily extended, nor can its extensions be easily composed. I gave as examples an extension that allows you to mark tests as “todo” and an extension that did reference counting around each test case (for C modules). While writing the extensions themselves was a little harder than I would have liked, the biggest problem was composing them — using both at the same time. Specifically, you can’t compose them, not without writing all-new code to merge the two functionalities. Consider:

TODO support:
        140 lines (5 core classes, 4 support classes/funcs)

Refcounting support:
        117 lines (4 core classes)

Composition:
        197 lines (6 core classes, 4 support classes/funcs)
        105 lines (3 classes of entirely new/rewritten code)

(All code snippets can be found in this directory. Code related to the old unittest design is in the before/ subdir, that related to the new design is in after/.)

test_harness, my new unittest package, was designed with flexibility and extensibility in mind. Using the same todo/refcounting examples from above:

TODO support:
        61 lines (1 core class, 4 support classes/funcs)

Refcounting support:
        36 lines (1 core class)

Composition:
        5 lines (1 core class, 3 imports)

That’s right: todo and refcounting support, with results written to stdout in five lines. And one of those lines is blank.

Where the new design really shines is in output. Unlike the old design — where you’d have to rewrite everything — changing your logging scheme from to-console to XML means changing this

    from test_harness import TextRunner
    from refcounting import RefcountRunner
    from todo import TodoRunner, TODO

    class OurRunner(TextRunner, RefcountRunner, TodoRunner):
        pass

to this:

    from xmlrunner import XmlTestRunner
    from refcounting import RefcountRunner
    from todo import TodoRunner, TODO

    class OurRunner(XmlTestRunner, RefcountRunner, TodoRunner):
        pass

That’s a two line change. That would have required a complete rewrite with the old system. Want both XML and to-console logging? Stick with the old unittest design and you’re looking at yet another rewrite. test_harness allows you to do this:

from test_harness import TextRunner
from xmlrunner import XmlTestRunner
from refcounting import RefcountRunner
from todo import TodoRunner, TODO

class OurRunner(TextRunner, XmlTestRunner, RefcountRunner, TodoRunner):
    pass

The biggest problem with the old unittest design is that, in trying to separate out the various concerns, it left the different components interconnected. TestCase objects depend on TestResult objects having certain methods; TestLoaders depend on your test case classes subclassing TestCase; TestRunners control which TestResult is used; etc. test_harness does away with this menagerie in favor of a single class: TestRunner. TestRunner objects are responsible for test suite iteration, running each individual test, collecting and categorizing any exceptions, and summarizing the results of the test run. Test loading/discovery is orthogonal to this process and as such is left to other packages, though rudimentary solutions are provided with the new package.

The biggest gripes about unittest I heard while researching unittest’s problems is that you a) have to subclass TestCase, and b) use TestCase methods to indicate test success/failure. In test_harness, there is no requirement to subclass TestCase (nor is there a TestCase class to subclass). Also, the usage of TestCase methods to signal failure — a consequence of the old TestCase/TestResult linkage — has been replaced with a test_harness.assertion submodule that contains functions like ok(), are_equal(), etc. Mapping old spellings to new:

    self.failUnless()               < = >    ok()
    self.assertEqual()              < = >    are_equal()
    self.failIfEqual()              < = >    are_not_equal()
    self.failUnlessAlmostEqual()    < = >    are_almost_equal()
    self.assertRaises()             < = >    raises()

Anyone interested is encouraged to play around with the new design. Comments to collinw at gmail point com

Following up on an earlier post, I’ve just submitted a trio of patches for Python’s unittest module to SourceForge:

  • Patch #1550272 is the test suite itself. It comprises 128 tests for the mission-critical parts of unittest.

  • Patch #1550273 fixes 6 issues uncovered while writing the test suite. Several other items that I raised earlier were judged to be either non-issues or behaviours that, while suboptimal, people have come to rely on.

  • Patch #1550263 follows up on an earlier patch I submitted for unittest’s docs. This new patch corrects and clarifies numerous sections of the module’s documentation.

I’m hopeful that these changes will make it into Python 2.5-final or 2.5.1 at the latest.

Here’s a list of the issues I uncovered while writing the test suite:

  1. TestLoader.loadTestsFromName() failed to return a suite when resolving a name to a callable that returns a TestCase instance.

  2. Fix a bug in both TestSuite.addTest() and TestSuite.addTests() concerning a lack of input checking on the input test case(s)/suite(s).

  3. Fix a bug in both TestLoader.loadTestsFromName() and TestLoader.loadTestsFromNames() that had ValueError being raised instead of TypeError. The problem occured when the given name resolved to a callable and the callable returned something of the wrong type.

  4. When a name resolves to a method on a TestCase
    subclass, TestLoader.loadTestsFromName() did not return
    a suite as promised.

  5. TestLoader.loadTestsFromName() would raise a ValueError (rather than a TypeError) if a name resolved to an invalid object. This has been fixed so that a TypeError is raised.

  6. TestResult.shouldStop was being initialised to 0 in TestResult.__init__. Since this attribute is always used in a boolean context, it’s better to use the False
    spelling.

As promised, and prompted in part by a recent post by Brett Cannon, here’re my thoughts on why unittest sucks.

Reading the docs for unittest, you’d think it would be easily extensible. You see things like TestCase.defaultTestResult(), the many overridable methods on TestResult objects, the apparent flexibility of TextTestRunner, and you get it in your head that it should be pretty easy to make it do whatever you want.

Armed with this impression and your Python-foo, you set off to write a unittest extension that will let you mark certain tests as “TODO”. You want the test harness to count these tests differently than normal tests: TODO tests are supposed to fail, and you want to be notified they start unexpectedly passing.

You tinker a bit, you poke and you prod, and you wind up with your extension, and the whole thing works great. You just wish you hadn’t had to subclass _TestTextResult, TestCase and TextTestRunner to get the job done. You feel like it could have been easier, but you don’t pay it much mind. After all, you only needed the one extension.

A few months later, a different project has a need to run reference-count checks around each test case for a C extension module. Confident from your first experience extending unittest, you head back into the code. A little later, you emerge, bearing the shiny new reference count-checking extension to unittest. You again ended up subclassing _TestTextResult, TestCase and TextTestRunner, but again, it’s just one extension.

An hour later, your boss walks by and says that the ref-counting extension and the TODO extension need to be combined so they can be used together on a new project. No problem, you say; composing the two should be cake.

That thought lasts about as long as it takes to load the extensions in your editor of choice.

unittest might have been intended to be extended, but only in simple ways, and only by one extension at a time. I’ll save you the suspense; to combine the above extensions, you have to write a completely different third extension, which attempts to merge the two functionality sets as much as possible. You want to incorporate another extension, say one that logs the test results to a database? Tough luck.

unittest’s design is fundamentally broken. Little or no attempt was made to separate the different concerns at work here: TestCase instances can determine what result logger to use and how exceptions are to be interpreted. Making TextTestRunner use a subclass of _TestTextResult means subclassing the runner object. TestResult is responsible for converting tracebacks to a textual representation, even though this means that any result classes that want to do introspect the tracebacks end up completely rewriting much of TestResult in the process.

That’s the problem; next time, the solution.

In developing the common test harness for my functional, svnmock and the standard Python module unittest, mainly with respect to how hard it is to combine different extensions to unittest’s TestCase, TestRunner and TestResult classes.

So, I head off to start poking around in unittest’s internals, to see what mucking around I can do in order to be able to compose extensions the way I want to. (Exactly what I’m trying to achieve will be the subject of another post.) I find — to my great horror — that unittest’s test suite consists of the following code:

import unittest

def test_TestSuite_iter():
    '''
    >>> test1 = unittest.FunctionTestCase(lambda: None)
    >>> test2 = unittest.FunctionTestCase(lambda: None)
    >>> suite = unittest.TestSuite((test1, test2))
    >>> tests = []
    >>> for test in suite:
    ...     tests.append(test)
    >>> tests == [test1, test2]
    True
    '''

How’s that for irony.

So: I’ve now spent three or four days going over unittest’s documentation and code, building up a test suite as I go. Work is proceeding approximately like so:

  1. Cleaning up the documentation. The old docs were full of typos and grammatical problems, not to mention the blatant factual errors and omissions.

    The bulk of this work is already done: a patch for the docs was accepted and applied to Python’s SVN repository as r51123.

  2. Write the test suite. As it stands, I’m up to 121 tests, giving me 60% test coverage, according to figleaf (which proved gratifyingly easy to integrate into the test suite). The bulk of those tests are for unittest.TestLoader, particularly its loadTestsFromName() and loadTestsFromNames() methods.

  3. Fix the bugs. Thus far, I’ve uncovered 23 bugs in unittest; some of these are clear-cut and easy to fix, while others will require discussion on python-dev. In addition, I’ve got 14 test cases for functionality that unittest should have had from the beginning, but doesn’t; these will have to wait until Python 2.6.