lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Dunstan <tom.duns...@intecgroup.com.au>
Subject Excerpt pondering
Date Thu, 03 Oct 2002 04:41:22 GMT
Hi all

I posted a message asking about some of the API hooks which appear to be for
excerpt generation to the user list a couple of days ago, and haven't heard
anything back yet. 

I'd like some feedback on an idea that I have to extend lucene to hold the
extra information that it needs to stop me having to reparse the entire body
text again to generate excerpts. 

Basically, to work out which sections of the text have the terms that
generate the hit most frequently, I need the position of the terms in the
document. This info, AFAICS, is already stored, but isn't accessible to
someone from a Hits object. It would be nice to make it available somehow.

Also, to be able to work out where those terms were in the original
document, I'd like to store, and be able to retrieve, the start and end
offset in the original field, for each term. This info is currently attached
to the Term object, but AFAICS is not stored. Whether the best place to do
that would be an extension to the existing segments, or in a separate
segment file, I'm not sure. I haven't really spent enough time looking at
the mechanics of the files yet.

I'd really appreciate it if someone who understands how things work
underneath could say "That sounds great, but try it like this" or "Don't do
anything, we're currently implementing something similar" or even "You
idiot, look at http://xyz/ to do that".

Thanks

Tom

--

Tom Dunstan

Mobile  0417 895 244
_______

Intec Consulting Group
* PO Box 7012 Hutt Street * Level 1, 1 Hutt Street * Adelaide 5000
* Tel  +61 8 8359 2332 * Fax  +61 8 8359 2264
Email: tom.dunstan@intecgroup.com.au
Website: www.intecgroup.com.au
 

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message