Monday, May 28, 2007

Interesting Developments

It's been interesting times for the HTML Purifier codebase. A number of bleeding-edge features were checked in and refactored over the past few days. Highlights:

  1. ConfigForm: Configuration forms are a new way of looking at library configuration. Instead of having to consult the manual and type in every directive you want to use, an HTML interface allows you to tweak and instantly see the results. When you are done, you can generate a PHP stub that will implement this configuration for you. Right now, only the forms are done, but they are already playing a large role in several of the smoketests.
  2. ConfigDoc: Configuration documentation is evolving into something of its own sub-project. It has its own class tree (under the namespace ConfigDoc) and the refactoring will only continue. Steps have been taken to make it easier to use alternative schemas with the codebase.
  3. HTMLDefinition: The amount of convenience functionality packed into this subsystem is unbelievable. There's caching, string conversions to objects, anonymous modules, etc. The flagship product of 1.7 should revolutionize the way end-users specify custom tags and attributes.

Oh, and don't forget the new Tidy modules.

HTML output by DOM: the Path to Enlightenment

The more I work with the DOM (Document Object Model) and XSLT (Extensible Stylesheet Language Transformations), the more I become convinced that DOM to HTML is the way to go. Conceptually speaking, this makes sense: once it gets to the browser's end it gets converted back into a DOM: the HTML is simply a string "serialization" format, an easy to transfer representation of something a little more abstract. Why muddy the waters: create what you wish to receive.

But also, it's the only method that's really going to get you standards compliance. At the very least, your documents will be well-formed, but if you're smart, you'll be loading the DTD and getting real time feedback on whether or not your documents validate (this is the way the HTML Purifier website is set up). Instead of going to the W3C validator, the validator comes to you. Concatenating strings is so retro.

It is quite disappointing, then, to have to revert to traditional styles of generating HTML when coding in PHP 4. I really like the DOM extension, and I really like the fact that it is deployed on almost every PHP 5 installation, but for interests of portability I cannot use it in HTML Purifier. This cuts off a lot of interesting design paths. Well, what can you do...

Friday, May 25, 2007

The verdict on the SVN extension

Little did I know what I would be getting myself into when I volunteered to write documentation for the SVN extension for PHP. It was, originally, supposed to be a learning exercise to get myself familiar with the extension so I could use it with a project or two of my own (being a completely non-standard extension did not deter me: I simple compiled and dynamically linked in myself).

Of course, things are rarely that simple. I ended up:

  • Wrangling with and contributing patches for livedocs
  • Learning how to write Docbook
  • Getting a PHP.net CVS account
  • Submitting a multiple amount of bug reports and patches for the SVN extension

After working with and prodding the SVN extension for some time now, I've come down to the verdict that SVN was good, but not good enough. Why?

  1. Missing functionality: it's not called a Beta version for nothing: many functions have not yet been implemented, and virtually none of the switches have been implemented yet. It looks like I won't be able to scrap my SVN binaries yet.
  2. Quirky behavior: there are a multitude of subtle, logic bugs in the extension. For example, relative paths are resolved relative to the PHP binary directory, not the current working directory. At first it's a showstopper, then it's a nuisance since you have to call realpath() on all your filenames now. These need to be fixed and more of them need to be found, but this will only happen as usage of the SVN extension goes up.
  3. Outdated client libraries for pecl4win: alright, this may not be a showstopper for you, but it certainly is quite a nuisance for me. Subversion 1.4 upgraded all working copies in a way that made them incompatible with version 1.3 versions. And, the lord be praised, pecl4win is still chugging along and using the version 1.3 client headers! Which means it can't read any of my Subversion checkouts. Since compiling PHP was an utter disastrous failure on Windows, I can only hope that they upgrade it soon. I have filed a bug accordingly.

In the meantime, I will try to get docs for the SVN extension as soon as possible anyway, so more people will use the extension, and Alan and Wez will have more incentive to release another version. :-)

Monday, May 21, 2007

Dependency Chart and Testing Data Implementations

I just hand-drew a dependency chart of HTML Purifier, and let me say, it ain't pretty. Unfortunately, I don't have any pictures for you folks (not that "you folks" exist, heh, heh) since did the diagram in pencil and it needs cleaning up, perhaps some SVG-izing, before it gets released to the wild, but there's some interesting information to be picked up from this draft diagram.

First of all, in many areas of HTML Purifier, there are "sub-systems" which are self-contained and don't have any dependencies outside of their group. Examples:

  • Encoder, EntityParser and EntityLookup
  • PercentEncoder
  • AttrDef_URI_Host, IPv4, IPv6

These tend to be independent components that could have easily been shipped in some other library, but for some reason or another where needed by HTML Purifier. They also are used in very few places (one or two at most).

There are two primary areas in the graph that are "dependency hell": the Strategy classes, and the HTMLDefinition sub-system. The Strategy classes have ridiculously large amounts of dependencies to a whole manner of things, which one might suppose is sensible behavior, so I won't worry too much (even though it's like a matrix of lines criss-crossing in and out.)

In the HTMLDefinition sub-system, however, you start seeing some strange things. Theoretically speaking HTMLDefinition is the "facade", so to speak, but Strategies usually also need to directly interface with ElementDef, which in turn results in direct interfacing with ChildDef, AttrDef, AttrTransform and ChildDef, making for a not-so-effective facade. At least it provides a central access point.

Behind it all, is HTMLModuleManager. It has dependencies on every object in the compound: HTMLModule, AttrTypes, AttrCollections and ContentSets. Which makes sense, as this class is responsible for coordinating the actions of all these sub-objects. So how do these sub-objects work together?

Extremely strange ways, it turns out. Imagine an HTMLModule, which is the bread and butter of HTMLModuleManager. It essentially is a container of lots of ElementDef objects. The ElementDef object, however, is not always fully formed. It relies on HTMLModuleManager, AttrCollections and ContentSets (and, indirectly, AttrTypes) to kick it into shape. AttrCollections and ContentSets, however, need to know about all the modules to know what to do. Therefore, circular dependencies!

It's not as bad as it seems, for practical purposes. AttrCollections and ContentSets actually depend on different aspects of HTMLModule for their operation, so the dependency chain is broken. However, for unit-testing purposes, this is hellish.

To get down to the bottom of the problem, consider unit testing an HTMLModule. All a module does is set-up a half-baked representation of its elements, as well as some global attributes. How does one determine if the HTMLModule has operated correctly? By comparing it against a fully formed module that's correctly formed... but that's the whole point of HTMLModule! So, unless we figure out how to parse DTDs or XML Schemas (which are not guaranteed to be correct), we have no way of directly unit testing the class.

I tried to get around this by testing not for structure, but for the behavior of the ensuing HTMLModule. But this causes problems. Here's an analogy: we're able to test that 2 + 2 = 4. We're not, however, able to test that our worksheet says 2 + 2 (because if we did, we would have simply copied our worksheet). We could try testing the worksheet's behavior (sic our math evaluator on it and check the answers), but we end up redundantly expressing what our tests of 2 + 2 = 4 already had shown us. If Big Brother becomes our client and 2 + 2 = 5, we have to change the assertions for both the low-level 2 + 2 = 4 check, and the corresponding integration test.

The logical recourse, would be to mock the subsystem. But as our dependency chart shows us, this is easier said than done.

It quite possibly boils down to this: the whole point of testing is that, after we've finished our work, we can go through a much simpler algorithm to verify our results. For a mathematical equation, this is simple: we can simply make some problems and an answer key. For an HTML definition, this is not simple: we cannot make "problems", for they are more complicated than the definition themselves.

Maybe a smoketest is the way to go. Grumble grumble.

Saturday, May 19, 2007

It's... It's ALIVE! (Tidy)

After an afternoon of intense, heavy refactoring, I'm happy to announce that the Tidy module setup has been committed to the trunk! Instead of ugly, inflexible TransformThis and TransformThat modules, cleaning and fixing up of poorly formed HTML has been refactored into Tidy_XHTML, Tidy_XHTMLStrict and Tidy_XHTMLAndHTML4. Documentation will be coming soon.

For the average Joe user, this architectural change will now allow me to implement deprecated attributes and, when in Transitional mode, not transform them unless you want me to.

Addendum: The Tidy module has nothing to do with Dave Raggett's HTML Tidy. The name was, however, borrowed from his excellent program.

Update: Documentation in FAQ form available for Tidy

Friday, May 18, 2007

The necessity of entity-izing quotes

I was quite surprised today to find that a document with un-escaped, character data double quotes passed validation. Classically, I've always believed that double quotes had to be escaped with ", just as the ampersand needs to be escaped. Apparently, this is not the case:

The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form

The spec goes on to say:

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as "&apos;", and the double-quote character (") as "&quot;".

I have previously stated that quotes are required as per the specification. My bad!

Thursday, May 17, 2007

Blogger Linebreaking

I found myself turning off Blogger's automatic line-breaking functionality, mainly because it was getting in the way of me writing semantic, well-spaced out markup in the HTML editor. Normally, I consider auto-line-breaking to be quite a useful feature: after all, who wants to slog out a <p> tag every time they want to create a paragraph? There are, however, two problems with Blogger's line-breaking algorithm:

  1. It uses br tags, which are extremely semantically incorrect. The correct tag should be the p tag.
  2. It is extremely dumb, inserting br tags where there should not be any. This effectively makes writing raw HTML impossible, as any time you try to add whitespace it gets converted into an onslaught of br tags

Fortunately, they do allow it to be turned off, which is where it shall be staying! It's a pity though, that the Preview feature doesn't recognize the setting and appropriately convert the HTML (also effectively useless, meh). I ought to file a bug report, but it'll probably get sweeped under the rug.

This is all the more reason to step of development of HTML Purifier's own projected smart auto-line-break functionality, which will only add paragraph tags when it makes sense to! Yippee!