lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Sturge <>
Subject Re: Term numbering and range filtering
Date Mon, 10 Nov 2008 22:26:00 GMT
I think we've gone around in a loop here. It's exactly due to the inadequacy
of cached filters that I'm considering what I'm doing.

Here's the section from my first email that is most illuminating:
The reason I have this question is that I am writing a multi-filter for
single term fields. My index contains many fields for which each document
contains a single term (e.g. date, zipcode, country) and I need to perform
range queries or set matches over these fields, many of which are very
inclusive (they match >10% of the total documents)

A cached RangeFilter works well when there are a small number of potential
options (e.g. for countries) but when there are many options (consider a
date range or a set of zipcodes) there are too many potential choices to
cache each possibility and it is too inefficient to build a filter on the
fly for each query (as you have to visit 10% of documents to build the
filter despite the query itself matching 0.1%)

Therefore I was considering building a int[reader.maxDocs()] array for each
field and putting into it the term number for each document. This relies on
the fact that each document contains only a single term for this field, but
with it I should be able to quickly construct a ³multi-filter² (that is,
something that iterates the array and checks that the term is in the range
or set).

Does this help explain my rationale? The reason I'm posting here is that I
imagine there are lots of people with this issue. In particular date ranges
seem to be something that lots of people use but Lucene implements fairly


On 11/10/08 1:58 PM, "Paul Elschot" <> wrote:

> Op Monday 10 November 2008 22:21:20 schreef Tim Sturge:
>> Hmmm -- I hadn't thought about that so I took a quick look at the
>> term vector support.
>> What I'm really looking for is a compact but performant
>> representation of a set of filters on the same (one term field).
>> Using term vectors would mean an algorithm similar to:
>> String myfield;
>> String myterm;
>> TermVector tv;
>> for (int i = 0 ;  i < maxDoc ; i++) {
>>     tv = reader.getTermFreqVector(i,country)
>>     if (tv.indexOf(myterm) != -1) {
>>           // include this doc...
>>         }
>> }
>> The key thing I am looking to achieve here is performance comparable
>> to filters. I suspect getTermFremVector() is not efficient enough but
>> I'll give it a try.
> Better use a TermDocs on myterm for this, have a look at the code of
> RangeFilter.
> Filters are normally created from a slower query by setting a bit in an
> OpenBitSet at "include this doc". Then they are reused for their speed.
> Filter caching could help. In case memory becomes a problem
> and the filters are sparse enough, try and use SortedVIntList
> as the underlying data structure in the cache. (Sparse enough means
> less than 1 in 8 of all docs available the index reader.)
> See also LUCENE-1296 for caching another data structure than the
> one used to collect the filtered docs.
> Regards,
> Paul Elschot
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message