You are hereOverriding DTD files in Java

Overriding DTD files in Java


By chri - Posted on 19 January 2008

Speed increase

Some people know I'm working on an application called mac2date. For that application I need to parse XML files present on the Hard Drive. As every XML should do they declare their DTD (or schema for newer systems). But many of the files I read declare it using a path on the internet.
This means the XML parser I use needs to get the online-DTD every time it reads an XML file. That's something that's really really slow. Finally I found a way to override the declared DTD with a local one.

Difference? Speed-increase from 33 seconds to 2 seconds for parsing 100 XML files.

The code

The code:

// construct the documentBuilder (XML parser) and the XPath expressions
documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
// override the DTD from the pList file and use a local one
documentBuilder.setEntityResolver(new PListEntityResolver());
// add here the rest of your parsing code

And the needed class:

package org.mac2date;
import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
public class PListEntityResolver implements EntityResolver {
  public InputSource resolveEntity (String publicId, String systemId){
    return new InputSource("/System/Library/DTDs/PropertyList.dtd");
  }
}
Chris, You are onto something. Could you explain how this works in standard browsers (like Firefox e.g.) reading an xhtml page. Do they also alway go and fetch the "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" and xmlns="http://www.w3.org/1999/xhtml" files that are on top of every page of your (and my) Drupal set-up serve? I have always assumed that browsers would cache this ? Thanks, Peter
They seem to keep it in their memory. I sniffed the traffic when loading the page and it didn't check those DTD's.

We should check the sources tho. Downloading right now those of Firefox...
The files content/base/src/nsNameSpaceManager.cpp and extensions/schema-validation/src/nsSchemaValidator.cpp look the most interesting.

Just do a grep -R w3.org * | less on the sources of firefox.




XML Catalogs are another way of achieving this. They can be better if you do not wish to have to change the code itself each time the location changes. Its basically a mapping file between the URL in the dtd/schema and a local copy.

See http://en.wikipedia.org/wiki/XML_Catalog for more details and a Java example

I Love Belgium... and you?

About Me
GnuPG Public Key Still More LinkedIn profile
Photos
Projects
WeIDS 2.0 Linux Lessons WiFi Auth Project
Documentation
Acer Aspire 2012 WLMi Acer TM 4002 WLMi IR-receiver (Win)(NL)
Links
Ubuntu Belgium Planet Grep

FOSDEM BruCON Profoss hacker emblem www.cacert.org Get OpenOffice Get Firefox Get Thunderbird