lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Term numbering and range filtering
Date Tue, 11 Nov 2008 11:11:27 GMT

It seems like for many of your examples (age, zip code, country),
simply computing & storing the mapping yourself (your first option
below) would actually be viable?

Also: I think in fact you never need to merge the term numbering for
many segments during searching?  Ie, the search runs one IndexReader
at a time, so your term numbering only needs to run in the context of
a single IndeReader?

If merging-for-searching is not necessary then I think this ("get term
number" for a single segment) is a straightforward extension to
TermInfosReader and then you can accomplish forward (Term -> number)
mappping via TermInfosReader without holding an all-in-ram map (it
will cost you a seek to look up the number of a term though, but, seek
will soonish be much lower cost, with SSDs, so...).

Also, if the reverse mapping is ever needed (I think it may not be?)
TermInfosReader could also be extended to seek by term number.

And... the way FieldCache API works now is actually good for your use
case, because it enums the Terms in order, and then for each Term
enums all docs having that Term.  You could extend that to increment
the term counter as its loading.

Whereas LUCENE-1231 (column stride fields) wouldn't initially make
this ("translate term text to term number while loading") possible,
though maybe we should in fact make that an option (which is I think
the same as your 2nd option below).

So I think the separable idea of using term numbers, both in the index
format when storing column-stride fields, and then in RAM for fast
RAM-based range filtering/scoring, makes alot of sense.

It will get tricky when you need a different Collator than simple Java
String comparisons, but LUCENE-1435 is already working towards letting
you use a different Collator per-field.


Tim Sturge wrote:

> Yes, that is a significant issue. What I'm coming to realize is that  
> either
> I will end up with something like
> class MultiFilter {
>   String field;
>   private int[] termInDoc;
>   Map<Term,int> termToInt;
>   ...
> }
> which can be entirely built on the current lucene APIs but has  
> significantly
> more overhead (the termToInt mapping in particular and the need to  
> construct
> the mapping and array on startup)
> Or I can go deep into the guts and add a data file per-segment with  
> a format
> something like
> int version
> int numFields
> (int fieldNum, long offset) ^ numFields
> (int termForDoc) ^ (maxDocs * numFields)
> and add something to FieldInfo  like  boolean storeMultiFilter;
> and FieldInfos something like STORE_MULTIFILTER = 0x40; I'd need to  
> add
> an int termNum to the .tis file as well.
> This is clearly a lot more work than the first solution, but it is a  
> lot
> nicer to deal with as well. Is this interesting to anyone other than  
> me?
> Tim
> On 11/9/08 12:23 PM, "Michael McCandless"  
> <> wrote:
>> Conceivably, TermInfosReader could track the sequence number of each
>> term.
>> A seek/skipTo would know which sequence number it just jumped too,
>> because the index is regular (every 128 terms by default), and then
>> each next() call could increment that.  Then retrieving this number
>> would be as costly as calling eg IndexReader.docFreq(Term) is now.
>> But I'm not sure how a multi-segment index  would work, ie how would
>> MultiSegmentReader compute this for its terms?  Or maybe you'd just  
>> do
>> this per-segment?
>> Mike
>> Tim Sturge wrote:
>>> Hi,
>>> I’m wondering if there is any easy technique to number the terms in
>>> an index
>>> (By number I mean map a sequence of terms to a contiguous range of
>>> integers
>>> and map terms to these numbers efficiently)
>>> Looking at the Term class and the .tis/.tii index format it appears
>>> that the
>>> terms are stored in an ordered and prefix-compressed format, but
>>> while there
>>> are pointers from a term to the .frq and .prx files, neither is  
>>> really
>>> suitable as a sequence number.
>>> The reason I have this question is that I am writing a multi-filter
>>> for
>>> single term fields. My index contains many fields for which each
>>> document
>>> contains a single term (e.g. date, zipcode, country) and I need to
>>> perform
>>> range queries or set matches over these fields, many of which are  
>>> very
>>> inclusive (they match >10% of the total documents)
>>> A cached RangeFilter works well when there are a small number of
>>> potential
>>> options (e.g. for countries) but when there are many options
>>> (consider a
>>> date range or a set of zipcodes) there are too many potential
>>> choices to
>>> cache each possibility and it is too inefficient to build a filter
>>> on the
>>> fly for each query (as you have to visit 10% of documents to build  
>>> the
>>> filter despite the query itself matching 0.1%)
>>> Therefore I was considering building a int[reader.maxDocs()] array
>>> for each
>>> field and putting into it the term number for each document. This
>>> relies on
>>> the fact that each document contains only a single term for this
>>> field, but
>>> with it I should be able to quickly construct a “multi-filter” (that
>>> is,
>>> something that iterates the array and checks that the term is in the
>>> range
>>> or set).
>>> Right now it looks like I can do some very ugly surgery and perhaps
>>> use the
>>> offset to the prx file even though it is not contiguous. But I’m
>>> hoping
>>> there is a better technique that I’m just not seeing right now.
>>> Thanks,
>>> Tim
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message