Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of TSturge@hi5.com designates
 64.18.1.38 as permitted sender)
User-Agent: Microsoft-Entourage/12.13.0.080930
Date: Fri, 07 Nov 2008 12:26:22 -0800
Subject: Term numbering and range filtering
From: Tim Sturge <tsturge@hi5.com>
To: <java-user@lucene.apache.org>
Message-ID: <C539E46E.1195%tsturge@hi5.com>
Thread-Topic: Term numbering and range filtering
Thread-Index: AclBFxk5NF04ic85dUu977IUyCu9IA==
Mime-version: 1.0
Content-type: multipart/alternative;
	boundary="B_3308905583_123885350"

--B_3308905583_123885350
Content-type: text/plain;
	charset="ISO-8859-1"
Content-transfer-encoding: quoted-printable

Hi,

I=B9m wondering if there is any easy technique to number the terms in an inde=
x
(By number I mean map a sequence of terms to a contiguous range of integers
and map terms to these numbers efficiently)

Looking at the Term class and the .tis/.tii index format it appears that th=
e
terms are stored in an ordered and prefix-compressed format, but while ther=
e
are pointers from a term to the .frq and .prx files, neither is really
suitable as a sequence number.

The reason I have this question is that I am writing a multi-filter for
single term fields. My index contains many fields for which each document
contains a single term (e.g. date, zipcode, country) and I need to perform
range queries or set matches over these fields, many of which are very
inclusive (they match >10% of the total documents)

A cached RangeFilter works well when there are a small number of potential
options (e.g. for countries) but when there are many options (consider a
date range or a set of zipcodes) there are too many potential choices to
cache each possibility and it is too inefficient to build a filter on the
fly for each query (as you have to visit 10% of documents to build the
filter despite the query itself matching 0.1%)

Therefore I was considering building a int[reader.maxDocs()] array for each
field and putting into it the term number for each document. This relies on
the fact that each document contains only a single term for this field, but
with it I should be able to quickly construct a =B3multi-filter=B2 (that is,
something that iterates the array and checks that the term is in the range
or set).

Right now it looks like I can do some very ugly surgery and perhaps use the
offset to the prx file even though it is not contiguous. But I=B9m hoping
there is a better technique that I=B9m just not seeing right now.

Thanks,

Tim

--B_3308905583_123885350--