lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aurora <>
Subject Re: which HTML parser is better?
Date Thu, 03 Feb 2005 16:24:33 GMT
For all parser suggestion I think there is one important attribute. Some  
parsers returns data provide that the input HTML is sensible. Some parsers  
is designed to be most flexible as tolerant as it can be. If the input is  
clean and controlled the former class is sufficient. Even some regular  
expression may be sufficient. (I that's the original poster wants). If you  
are building a web crawler you need something really tolerant.

Once I have prototyped a nice and fast parser. Later I have to abandon it  
because it failed to parse about 15% documents (problem handling nested  
quotes like onclick="alert('hi')").

> No one has yet mentioned using ParserDelegator and ParserCallback that  
> are part of HTMLEditorKit in Swing.  I have been successfully using  
> these classes to parse out the text of an HTML file.  You just need to  
> extend HTMLEditorKit.ParserCallback and override the various methods  
> that are called when different tags are encountered.
> On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
>> Three HTML parsers(Lucene web application
>> demo,CyberNeko HTML Parser,JTidy) are mentioned in
>> Lucene FAQ
>> 1.3.27.Which is the best?Can it filter tags that are
>> auto-created by MS-word 'Save As HTML files' function?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message