xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Gaspar" <paulo.gas...@krankikom.de>
Subject RE: [ANNOUNCE] Xerces HTML Parser
Date Sat, 16 Feb 2002 02:14:37 GMT
Make it two persons!

Paulo Gaspar

> -----Original Message-----
> From: Scott Sanders [mailto:ssanders@nextance.com]
> Sent: Thursday, February 14, 2002 11:12 PM
> To: general@xml.apache.org
> Subject: RE: [ANNOUNCE] Xerces HTML Parser
> 
> 
> I personally find this to be greatly helpful, after having completely
> hacking Jtidy to take care of most of the 'edge' conditions in malformed
> HTML that we could find, just to get a DOM, just to be able to use XSLT.
> If this was part of Xerces, or even an add-in, it would be greatly
> appreciated by at least one person.
> 
> Scott Sanders
> 
> > -----Original Message-----
> > From: Andy Clark [mailto:andyc@apache.org] 
> > Sent: Thursday, February 14, 2002 1:53 PM
> > To: xerces-j-dev@xml.apache.org; 
> > xerces-j-user@xml.apache.org; general@xml.apache.org
> > Subject: Re: [ANNOUNCE] Xerces HTML Parser
> > 
> > 
> > It was bugging me that the first version of the NekoHTML parser 
> > could only handle the character encoding "Cp1252" (which is 
> > the basic Windows encoding), so I updated the code to be able 
> > to automatically handle UTF-8 (w/ BOM) and UTF-16. In 
> > addition, it can detect the presence of a <meta 
> > http-equiv='content-type' content='text/html; charset=XXX'> 
> > tag and scan the remaining document using charset "XXX", 
> > assuming that Java has an appropriate decoder available.
> > 
> > You can download the latest code from the following URL:
> > 
> >   http://www.apache.org/~andyc/
> > 
> > I am very interested in hearing from people to see if the 
> > code is useful and if they think it should be a standard part 
> > of Xerces-J. 
> > 
> > Solving the problem of changing the character decoder in the 
> > middle of the stream when the <meta> tag is detected was 
> > rather interesting. If you want to know the technical 
> > details, read on...
> > 
> > The code isn't that complicated but it turned out to be not
> > as straightforward as I thought. First, the Java decoders 
> > have a nasty habit of reading 8K of bytes despite only asking for 
> > as little as a single character! This is annoying, at best, 
> > because you can't change the decoder because the original 
> > decoder has already consumed more bytes than it should.
> > 
> > Then, even if the Java decoders were written to only consume
> > as many bytes as needed to return the requested characters, 
> > there's still a problem caused by buffering. Since I buffer a 
> > block of characters to improve performance, this again 
> > consumes bytes *past* the <meta> tag which will destroy any 
> > chance of changing the decoder mid-stream.
> > 
> > So to solve this problem, I wrote a "playback" input stream 
> > which buffers all of the bytes read on the underlying input 
> > stream. If the scanner detects a <meta> tag that changes the 
> > encoding, then the stream is played back again. And if the 
> > <body> tag is found (or a tag whose parent should be the 
> > <body> tag), then the buffer is cleared. So at worst, just 
> > the beginnging of the document is buffered which isn't 
> > too bad.
> > 
> > You may notice that if the stream is played back, then the 
> > parser will scan document contents that it has already 
> > seen. This was simple enough to fix, though. When the
> > character encoding is changed, I note how many elements I
> > have already seen. Then, when the stream is re-scanned, I 
> > ignore the events until the number of elements is back to 
> > where I was when I detected the <meta> tag.
> > 
> > So there's got to be an easier way to change the decoder
> > of the stream than to go through all of this trouble,
> > right? Not unless I want to re-write every known character 
> > decoder. So I'm stuck with this kind of a solution. But it 
> > seems to work very well.
> > 
> > -- 
> > Andy Clark * andyc@apache.org
> > 
> > ---------------------------------------------------------------------
> > In case of troubles, e-mail:     webmaster@xml.apache.org
> > To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> > For additional commands, e-mail: general-help@xml.apache.org
> > 
> > 
> 
> ---------------------------------------------------------------------
> In case of troubles, e-mail:     webmaster@xml.apache.org
> To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> For additional commands, e-mail: general-help@xml.apache.org
> 

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org


Mime
View raw message