lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Term numbering and range filtering
Date Sun, 09 Nov 2008 20:23:56 GMT

Conceivably, TermInfosReader could track the sequence number of each  
term.

A seek/skipTo would know which sequence number it just jumped too,  
because the index is regular (every 128 terms by default), and then  
each next() call could increment that.  Then retrieving this number  
would be as costly as calling eg IndexReader.docFreq(Term) is now.

But I'm not sure how a multi-segment index  would work, ie how would  
MultiSegmentReader compute this for its terms?  Or maybe you'd just do  
this per-segment?

Mike

Tim Sturge wrote:

> Hi,
>
> I’m wondering if there is any easy technique to number the terms in  
> an index
> (By number I mean map a sequence of terms to a contiguous range of  
> integers
> and map terms to these numbers efficiently)
>
> Looking at the Term class and the .tis/.tii index format it appears  
> that the
> terms are stored in an ordered and prefix-compressed format, but  
> while there
> are pointers from a term to the .frq and .prx files, neither is really
> suitable as a sequence number.
>
> The reason I have this question is that I am writing a multi-filter  
> for
> single term fields. My index contains many fields for which each  
> document
> contains a single term (e.g. date, zipcode, country) and I need to  
> perform
> range queries or set matches over these fields, many of which are very
> inclusive (they match >10% of the total documents)
>
> A cached RangeFilter works well when there are a small number of  
> potential
> options (e.g. for countries) but when there are many options  
> (consider a
> date range or a set of zipcodes) there are too many potential  
> choices to
> cache each possibility and it is too inefficient to build a filter  
> on the
> fly for each query (as you have to visit 10% of documents to build the
> filter despite the query itself matching 0.1%)
>
> Therefore I was considering building a int[reader.maxDocs()] array  
> for each
> field and putting into it the term number for each document. This  
> relies on
> the fact that each document contains only a single term for this  
> field, but
> with it I should be able to quickly construct a “multi-filter” (that  
> is,
> something that iterates the array and checks that the term is in the  
> range
> or set).
>
> Right now it looks like I can do some very ugly surgery and perhaps  
> use the
> offset to the prx file even though it is not contiguous. But I’m  
> hoping
> there is a better technique that I’m just not seeing right now.
>
> Thanks,
>
> Tim


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message