lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@lucene.com>
Subject Re: Excerpt pondering
Date Thu, 03 Oct 2002 17:05:23 GMT
Tom Dunstan wrote:
> I'd like some feedback on an idea that I have to extend lucene to hold the
> extra information that it needs to stop me having to reparse the entire body
> text again to generate excerpts. 
> 
> Basically, to work out which sections of the text have the terms that
> generate the hit most frequently, I need the position of the terms in the
> document. This info, AFAICS, is already stored, but isn't accessible to
> someone from a Hits object. It would be nice to make it available somehow.

That's not impossible, but would require a substantial re-working of the 
search code, and would probably make search slower.  Also, I'm not sure 
how useful it would really be.

> Also, to be able to work out where those terms were in the original
> document, I'd like to store, and be able to retrieve, the start and end
> offset in the original field, for each term. This info is currently attached
> to the Term object, but AFAICS is not stored. Whether the best place to do
> that would be an extension to the existing segments, or in a separate
> segment file, I'm not sure. I haven't really spent enough time looking at
> the mechanics of the files yet.

This would greatly increase the size of the index, and would be 
difficult to make efficiently randomly accessible.

However the primary rationale for not including this in the index is 
that typically you only display ten or so documents.  Re-tokenizing ten 
documents should only take a fraction of a second, and thus can be 
efficiently done at search time: there's no need to store the exact 
positions in the text.

Doug


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message