lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <>
Subject Re: which HTML parser is better?
Date Thu, 03 Feb 2005 10:18:30 GMT

Two things, try to experiment with both:

1) I would try to write a lexical scanner that strips HTML tags, much 
like the regular expression does. Java lexical scanner packages produce 
nice pure Java classes that seldom use any advanced API, so they should 
work on Java 1.1. They are simple state machines with states encoded in 
integers -- this should work like a charm, be fast and small.

2) Write a parser yourself. Having a regular expression it isn't that 
difficult to do... :)


Karl Koch wrote:
> I appologise in advance, if some of my writing here has been said before.
> The last three answers to my question have been suggesting pattern matching
> solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
> is something I cannot use since I work with Java 1.1 on a PDA.
> I am wondering if somebody knows a piece of simple sourcecode with low
> requirement which is running under this tense specification.
> Thank you all,
> Karl
>>No one has yet mentioned using ParserDelegator and ParserCallback that 
>>are part of HTMLEditorKit in Swing.  I have been successfully using 
>>these classes to parse out the text of an HTML file.  You just need to 
>>extend HTMLEditorKit.ParserCallback and override the various methods 
>>that are called when different tags are encountered.
>>On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
>>>Three HTML parsers(Lucene web application
>>>demo,CyberNeko HTML Parser,JTidy) are mentioned in
>>>Lucene FAQ
>>>1.3.27.Which is the best?Can it filter tags that are
>>>auto-created by MS-word 'Save As HTML files' function?
>>Bill Tschumy
>>Otherwise -- Austin, TX
>>To unsubscribe, e-mail:
>>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message