lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Performance of hit highlighting and finding term positions for a specific document
Date Wed, 31 Mar 2004 17:16:03 GMT
Kevin A. Burton wrote:
> I'm playing with this package:
> 
> http://home.clara.net/markharwood/lucene/highlight.htm
> 
> Trying to do hit highlighting.  This implementation uses another 
> Analyzer to find the positions for the result terms.
> This seems that it's very inefficient

Does it just seem inefficient, or is is it actually too inefficient in 
practice?  Folks have benchmarked this, and, for documents less than 10k 
characters or so, re-tokenizing is fast enough.  But it can be slow if 
the majority of your documents are longer than this.

Several solutions have been proposed.  The simplest is to not scan past 
the first 10k or so for snippets unless nothing relevant is found in the 
first 10k.  I don't think Mark's highlighter yet does this, but I might 
be mistaken.

> since lucene already knows the 
> frequency and position of given terms in the index.

Lucene indexes record that a term is the nth term, not that it occurs at 
the nth character in the text.  The latter is needed for highlighting, 
but storing this would make indexes much larger and slower to update.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message