I just hand-drew a dependency chart of HTML Purifier, and let me say, it ain't pretty. Unfortunately, I don't have any pictures for you folks (not that "you folks" exist, heh, heh) since did the diagram in pencil and it needs cleaning up, perhaps some SVG-izing, before it gets released to the wild, but there's some interesting information to be picked up from this draft diagram.
First of all, in many areas of HTML Purifier, there are "sub-systems" which are self-contained and don't have any dependencies outside of their group. Examples:
- Encoder, EntityParser and EntityLookup
- PercentEncoder
- AttrDef_URI_Host, IPv4, IPv6
These tend to be independent components that could have easily been shipped in some other library, but for some reason or another where needed by HTML Purifier. They also are used in very few places (one or two at most).
There are two primary areas in the graph that are "dependency hell": the Strategy classes, and the HTMLDefinition sub-system. The Strategy classes have ridiculously large amounts of dependencies to a whole manner of things, which one might suppose is sensible behavior, so I won't worry too much (even though it's like a matrix of lines criss-crossing in and out.)
In the HTMLDefinition sub-system, however, you start seeing some strange things. Theoretically speaking HTMLDefinition is the "facade", so to speak, but Strategies usually also need to directly interface with ElementDef, which in turn results in direct interfacing with ChildDef, AttrDef, AttrTransform and ChildDef, making for a not-so-effective facade. At least it provides a central access point.
Behind it all, is HTMLModuleManager. It has dependencies on every object in the compound: HTMLModule, AttrTypes, AttrCollections and ContentSets. Which makes sense, as this class is responsible for coordinating the actions of all these sub-objects. So how do these sub-objects work together?
Extremely strange ways, it turns out. Imagine an HTMLModule, which is the bread and butter of HTMLModuleManager. It essentially is a container of lots of ElementDef objects. The ElementDef object, however, is not always fully formed. It relies on HTMLModuleManager, AttrCollections and ContentSets (and, indirectly, AttrTypes) to kick it into shape. AttrCollections and ContentSets, however, need to know about all the modules to know what to do. Therefore, circular dependencies!
It's not as bad as it seems, for practical purposes. AttrCollections and ContentSets actually depend on different aspects of HTMLModule for their operation, so the dependency chain is broken. However, for unit-testing purposes, this is hellish.
To get down to the bottom of the problem, consider unit testing an HTMLModule. All a module does is set-up a half-baked representation of its elements, as well as some global attributes. How does one determine if the HTMLModule has operated correctly? By comparing it against a fully formed module that's correctly formed... but that's the whole point of HTMLModule! So, unless we figure out how to parse DTDs or XML Schemas (which are not guaranteed to be correct), we have no way of directly unit testing the class.
I tried to get around this by testing not for structure, but for the behavior of the ensuing HTMLModule. But this causes problems. Here's an analogy: we're able to test that 2 + 2 = 4. We're not, however, able to test that our worksheet says 2 + 2 (because if we did, we would have simply copied our worksheet). We could try testing the worksheet's behavior (sic our math evaluator on it and check the answers), but we end up redundantly expressing what our tests of 2 + 2 = 4 already had shown us. If Big Brother becomes our client and 2 + 2 = 5, we have to change the assertions for both the low-level 2 + 2 = 4 check, and the corresponding integration test.
The logical recourse, would be to mock the subsystem. But as our dependency chart shows us, this is easier said than done.
It quite possibly boils down to this: the whole point of testing is that, after we've finished our work, we can go through a much simpler algorithm to verify our results. For a mathematical equation, this is simple: we can simply make some problems and an answer key. For an HTML definition, this is not simple: we cannot make "problems", for they are more complicated than the definition themselves.
Maybe a smoketest is the way to go. Grumble grumble.