lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jochen Frey" <lucenel...@quontis.com>
Subject RE: Performance of hit highlighting and finding term positions for a specific document
Date Wed, 31 Mar 2004 21:51:49 GMT
> Several solutions have been proposed.  The simplest is to not scan past
> the first 10k or so for snippets unless nothing relevant is found in the
> first 10k.  I don't think Mark's highlighter yet does this, but I might
> be mistaken.
> 
> > since lucene already knows the
> > frequency and position of given terms in the index.
> 
> Lucene indexes record that a term is the nth term, not that it occurs at
> the nth character in the text.  The latter is needed for highlighting,
> but storing this would make indexes much larger and slower to update.
> 

None of those solutions that I know about (other than re-parsing) work for
us (for us the highlighting must be confided to exactly one sentence), and
even though we are desperate to have something smarter, we would not want to
lose the benefits of super small and fast indexes.

We have pondered (but don't have the time, currently) to develop a package
that would store token locations (outside of the Lucene core) and hack
Lucene to get token-ids.

Sorry, no real solutions here; I guess this post is a +1 for keeping indexes
small and fast, and a +1 for this being a real problem without a perfect
solution (yet).

Jochen



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message