xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Clark <an...@apache.org>
Subject Re: [ANNOUNCE] Xerces HTML Parser
Date Thu, 14 Feb 2002 21:52:52 GMT
It was bugging me that the first version of the NekoHTML parser 
could only handle the character encoding "Cp1252" (which is the
basic Windows encoding), so I updated the code to be able to
automatically handle UTF-8 (w/ BOM) and UTF-16. In addition,
it can detect the presence of a <meta http-equiv='content-type'
content='text/html; charset=XXX'> tag and scan the remaining
document using charset "XXX", assuming that Java has an
appropriate decoder available.

You can download the latest code from the following URL:


I am very interested in hearing from people to see if the 
code is useful and if they think it should be a standard part 
of Xerces-J. 

Solving the problem of changing the character decoder in the
middle of the stream when the <meta> tag is detected was
rather interesting. If you want to know the technical 
details, read on...

The code isn't that complicated but it turned out to be not
as straightforward as I thought. First, the Java decoders have
a nasty habit of reading 8K of bytes despite only asking for 
as little as a single character! This is annoying, at best, 
because you can't change the decoder because the original
decoder has already consumed more bytes than it should.

Then, even if the Java decoders were written to only consume
as many bytes as needed to return the requested characters,
there's still a problem caused by buffering. Since I buffer
a block of characters to improve performance, this again
consumes bytes *past* the <meta> tag which will destroy any
chance of changing the decoder mid-stream.

So to solve this problem, I wrote a "playback" input stream
which buffers all of the bytes read on the underlying input
stream. If the scanner detects a <meta> tag that changes
the encoding, then the stream is played back again. And if
the <body> tag is found (or a tag whose parent should be
the <body> tag), then the buffer is cleared. So at worst,
just the beginnging of the document is buffered which isn't 
too bad.

You may notice that if the stream is played back, then the
parser will scan document contents that it has already 
seen. This was simple enough to fix, though. When the
character encoding is changed, I note how many elements I
have already seen. Then, when the stream is re-scanned, I
ignore the events until the number of elements is back to
where I was when I detected the <meta> tag.

So there's got to be an easier way to change the decoder
of the stream than to go through all of this trouble,
right? Not unless I want to re-write every known character
decoder. So I'm stuck with this kind of a solution. But
it seems to work very well.

Andy Clark * andyc@apache.org

In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

View raw message