lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Sturge <tstu...@hi5.com>
Subject Term numbering and range filtering
Date Fri, 07 Nov 2008 20:26:22 GMT
Hi,

I¹m wondering if there is any easy technique to number the terms in an index
(By number I mean map a sequence of terms to a contiguous range of integers
and map terms to these numbers efficiently)

Looking at the Term class and the .tis/.tii index format it appears that the
terms are stored in an ordered and prefix-compressed format, but while there
are pointers from a term to the .frq and .prx files, neither is really
suitable as a sequence number.

The reason I have this question is that I am writing a multi-filter for
single term fields. My index contains many fields for which each document
contains a single term (e.g. date, zipcode, country) and I need to perform
range queries or set matches over these fields, many of which are very
inclusive (they match >10% of the total documents)

A cached RangeFilter works well when there are a small number of potential
options (e.g. for countries) but when there are many options (consider a
date range or a set of zipcodes) there are too many potential choices to
cache each possibility and it is too inefficient to build a filter on the
fly for each query (as you have to visit 10% of documents to build the
filter despite the query itself matching 0.1%)

Therefore I was considering building a int[reader.maxDocs()] array for each
field and putting into it the term number for each document. This relies on
the fact that each document contains only a single term for this field, but
with it I should be able to quickly construct a ³multi-filter² (that is,
something that iterates the array and checks that the term is in the range
or set).

Right now it looks like I can do some very ugly surgery and perhaps use the
offset to the prx file even though it is not contiguous. But I¹m hoping
there is a better technique that I¹m just not seeing right now.

Thanks,

Tim

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message