lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Best Practices for getting Strings from a position range
Date Mon, 16 Jul 2007 02:18:29 GMT
Do we have a best practice for going from, say a SpanQuery doc/ 
position information and retrieving the actual range of positions of  
content from the Document?  Is it just to reanalyze the Document  
using the appropriate Analyzer and start recording once you hit the  
positions you are interested in?    Seems like Term Vectors _could_  
help, but even my new Mapper approach patch (LUCENE-868) doesn't  
really help, because they are stored in a term-centric manner.  I  
guess what I am after is a position centric approach.  That is, give  
a Document, get a term vector (note, not a TermFreqVector) back that  
is ordered by position (thus, there may be duplicate entries for a  
given term that occurs in multiple positions)

I feel like I am missing something obvious.  I would suspect the  
highlighter needs to do this, but it seems to take the reanalyze  
approach as well (I admit, though, that I have little experience with  
the highlighter.)

I am wondering if it would be useful to have an alternative Term  
Vector storage mechanism that was position centric.  Because we  
couldn't take advantage of the lexicographic compression, it would  
take up more disk space, but it would be a lot faster for these kinds  
of things.  With this kind of approach, you could easily index into  
an array based on the result of a SpanQuery.start(), etc.  Of course,  
you would have to have a data structure that handled the multiple  
terms per position option, but I don't think that would be too hard,  

Just thinking out loud...


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message