lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <>
Subject Re: commercial websites powered by Lucene?
Date Thu, 26 Jun 2003 04:54:52 GMT
On Wednesday 25 June 2003 09:47, Ulrich Mayring wrote:
> John Takacs wrote:
> > I'd love to try Lucene with the above, but the Lucene install fails
> > because of JavaCC issues.  Surprised more people haven't encountered this
> > problem, as the install instructions are out of date.
> Well, what do you need JavaCC for? Isn't it just the technology for
> building the supplied HTML-Parser? There are much better HTML parsers
> out there, which you can use.

On a related note; has anyone done performance measurements for various
HTML parsers used for indexing?

I have written couple of XML/HTML parsers that were optimized for speed 
(and/or leniency to be able to handle/fix non-valid documents), and was 
wondering if they might be useful for indexing purposes for other people (one 
is in general pretty optimal if document contents are fully in memory 
already, like when fetching from DB; another uses very little memory, while 
being only slightly slower). However, using those as opposed to more standard 
ones would only make sense if there are significant speed improvements.
And to do that, it would be good to have baseline measurements, and/or to know 
what are current best candidates, from performance perspective.

The thing is that creating a parser that only cares about textual content (and 
perhaps in some cases about surrounding element, but not about attributes, or 
structure, or DTD/Schema, validity etc) is fairly easy, and since indexing is 
often the most CPU-intensive part of search engine, it may make sense to try 
to optimize this part heavily, up to and including using specialized parsers.

-+ Tatu +-

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message