lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Getting tokens from search results. Simple concept
Date Sat, 07 Mar 2009 13:47:09 GMT

Mike Klaas wrote:

> On 5-Mar-09, at 2:42 PM, Chris Hostetter wrote:
>> : What I would LOVE is if I could do it in a standard Lucene search  
>> like I
>> : mentioned earlier.
>> : Hit.doc[0].getHitTokenList() :confused:
>> : Something like this...
>> The Query/Scorer APIs don't provide any mechanism for information  
>> like
>> that to be conveyed back up the call chain -- mainly because it's  
>> more
>> heavy weight then most people need.
>> If you have custom Query/Scorer implementations, you can keep track  
>> of
>> whatever state you want when executing a QUery -- in fact the  
>> SpanQuery
>> family of queries do keep track of exactly the type of info you  
>> seem to
>> want, and after executing a query, you can ask it for the "Spans"  
>> of any
>> matching document -- the down side is the a loss in performance of  
>> query
>> execution (because it takes time/memory to keep track of all the  
>> matches)
> Even then, if I'm not mistaken, spans track token _positions_, not  
> _offsets_ in the original string.

That's correct.

> A reverse text index like lucene is fast precisely because it  
> doesn't have to keep track of this information.

One option is to stuff the offsets into payloads, and then make a  
custom Query that decodes the offsets from the payload, and store it  
away when collecting hits.

> I think the best alternative might be to use termvectors, which are  
> essentially a cache of the analyzed tokens for a document.

Another way to think of term vectors is a single-document inverted  
index that you can retrieve in entirety.  Ie, it maps terms to their  
occurrences (count, positions, offsets) within the document.

I agree, term vectors should work for this.

I don't really understand, though, why the highlighter package doesn't  
work here -- it also just re-analyzes the text, when it can't find  
term vectors.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message