JTidy not handling some characters correctly

Certain characters get mangled after I call Tidy.parse. Two examples are: ’ instead of ' and ∼ instead of ~I'm guessing that these must have come from Word or something similar but the tidy handles them very badly. Specifically, it converts them to their individual entity representations for the diacritics which then get converted to meaningless junk later in my process. I'm sure there are others but these are the ones I have found so far. Is there any known way to convert these before hand or ignore them as part of the tidy? Tidy t...Read more

Comments getting escaped with NekoHTML (or JTidy) + XOM

I'm using NekoHTML to clean up some HTML, and then feeding it to XOM to get an object model. Somewhere in the course of this, comments are getting escaped.Here's a relevant example of the input HTML (most of the <head> cut for clarity):<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html lang="en"><head> <script type="text/JavaScript"> <!-- // Hide the JS startTimeout(6000000, "/"); // --> </script>Here's the code:// XOMSa...Read more

Run the jtidy tests

I'm trying to run the unit tests in the jtidy source but I'm getting this exception. Does anyone know how to fix this? I'm guessing the package folder is not setup right. java.lang.Error: java.util.MissingResourceException: Can't find bundle for base name org/w3c/tidy/TidyMessages, locale en_US at org.w3c.tidy.Report.(Report.java:649) at org.w3c.tidy.Tidy.(Tidy.java:135) at org.w3c.tidy.TidyTestCase.setUp(TidyTestCase.java:153) at junit.framework.TestCase.runBare(TestCase.java:128) at junit.framework.TestResult$1.protect(T...Read more

Can I prevent JTidy from converting an apostrophe in an attribute value to an entity

My input HTML has a line similar to this:<div class="image" style="background:url('/images/someImage.jpg') no-repeat;"/>which JTidy is converting to<div class="image" style="background:url(&apos;/images/someImage.jpg&apos;) no-repeat;"/>Is there a way to suppress that entity conversion? There appears to be a config method for preventing double quotes from being converted (setQuoteMarks()), but I don't see similar for apostrophes....Read more