Thursday, December 03, 2009

JTidy and handling HTML embedded in database strings

I've been dealing with an interesting issue. A customer's app needs to store documents created by end users. The document's are really just paragraphs created by the end user, using the YUI Rich Text/HTML editor. The user might also use the "plain text" editing mode, in which case they would actually enter tags like < h1> manually. With this freedom comes the inevitable possibility that they will enter invalid HTML. Since these documents will be stored in NextDB the user might also apply an XSLT transformation over the default HTML presentation of the result set.

The problem is that if the user enters invalid XHTML, like they just throw <blort&rt; into their document, then the XSLT will crash. The problem is made gnarlier by the fact that the documents themselves are likely to be fragments (and in fact they have to be fragments so that when they appear in the context of the default HTML presentation that they get treated as a valid portion of the overall document). In other words, if the user puts tags around their document, you have to strip those body tags off, so that the default HTML presentation doesn't have a nested body.

Enter JTidy. The JTidy library is all about handling crapped-up HTML, and fixing it on the fly. The journey to getting JTidy working was longer than I expected due to an unpleasant interaction with Maven. So, despite the fact that I had the most recent version of JTidy on my Netbeans's project's classpath, when I used Maven to startup Jetty using the mvn Jetty plugin, I would get runtime errors complaining that the method I was trying to call (tidy.setPrintBodyOnly(true);) didn't exist. So, like any good bughunt, the fub began. I knew that Tidy.class was somehow sneaking into my runtime classpath. The first place I looked was my local .m2 repository -- no joy. Finally occured to me that this must be an internal inclusion of an old Tidy jar by maven. When I ran 'grep -r Tidy.class' on my maven directory, I found that maven's 'uber jar' (maven-core-2.0.7-uber.jar) did in fact contain the older version of JTidy. Turns out if you look at the internal dependencies for Maven, you find that it depends on an old version of JTidy. So I unzipped the uber-jar, replaced all the JTidy classes with the latest and greatest, rezipped the jar, and ...badabing badaboom...problem solved. Bing bang boom, very good have a drink.

The actual coding took less than 2 minutes. Argghh talk about an 80/20 rule. Anyhow, JTidy does exactly what I want. Fix any crufted HTML that the user might enter, and then extract only the content of the body element (and even better, if the user doesn't include a body, JTidy fixes that first). So it all boils down to this (just to print the tidied content):

Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.setPrintBodyOnly(true);
tidy.parse(new ByteArrayInputStream((val+"").toString().getBytes()), System.out);

Epilogue: maybe blogger.com should start using Jquery. Ironically, this rich text editor threw up on some tags I typed into this post, and I had to spend a few minutes cleaning up the mess!!!

No comments:

Post a Comment