Friday, February 12, 2010

An update on JTIdy and XSLT

A couple months back I blogged about JTidy I have an update to that story. If you plan to run your XHTML through an XSLT transformation, don't use tidy.setXHTML(true). The reason for this, I found after a lot of debugging. There are a "named entities", specifically ones such as â (and many others) that are declared in the XHTML DTD. And guess what? They are NOT valid in XML.

Quick refresher: XML has only five named entities that it supports

So you're thinking, no biggie, I don't think my documents will ever have a â named entity in them so what do I care? Well, they can creep in my accident, if you are allowing users to save inputs into a database. For example, there are ISO-8859-1 (Latin-1) characters with no UTF-8 equivalent. One that I have repeatedly seeen over the years from European users is the "left quote". It doesn't even exist on an american keyboard, and when it gets posted to a system expecting UTF-8 it wreaks havoc, causing a run of several incorrectly decoded characters.

And so, you can get "inadvertant" named entities in your XHTML output due to this sort of character mangling where the UTF-8 byte stream interpretation gets borked. So now, instead of just getting some gibberish in your XSLT, the entire transformation crashes, and you get no output at all except a complaint that the input document had an unsupported character reference. Hence the solution: tell jtidy NOT to produce XHTML, but instead to produce plain old XML, like this: tidy.setXmlOut(true); When you do this JTidy doesn't put these named entities in anymore. 

One other surprise I ran into is that XML actually *DOES* support numeric character entities. So you will still get numeric character entities in the XML output such as &#128, which although specifically forbidden by HTML still render properly in most browsers. So JTidy happily outputs these numeric character entities in its XML output. You are now safe to apply XSLT transformations on a valid XML input, however, just be aware that the XHTML output of your XSLT transformation is in violation of the aforementioned forbidden, but practically allowed, numeric entities.

Confused yet? Me too. Let's all speak Esperanto and adopt 7-bit ASCII as the only allowed character set.


No comments:

Post a Comment