lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Sturge <tstu...@hi5.com>
Subject Re: Term numbering and range filtering
Date Mon, 10 Nov 2008 22:26:00 GMT
I think we've gone around in a loop here. It's exactly due to the inadequacy
of cached filters that I'm considering what I'm doing.

Here's the section from my first email that is most illuminating:
"
The reason I have this question is that I am writing a multi-filter for
single term fields. My index contains many fields for which each document
contains a single term (e.g. date, zipcode, country) and I need to perform
range queries or set matches over these fields, many of which are very
inclusive (they match >10% of the total documents)

A cached RangeFilter works well when there are a small number of potential
options (e.g. for countries) but when there are many options (consider a
date range or a set of zipcodes) there are too many potential choices to
cache each possibility and it is too inefficient to build a filter on the
fly for each query (as you have to visit 10% of documents to build the
filter despite the query itself matching 0.1%)

Therefore I was considering building a int[reader.maxDocs()] array for each
field and putting into it the term number for each document. This relies on
the fact that each document contains only a single term for this field, but
with it I should be able to quickly construct a ³multi-filter² (that is,
something that iterates the array and checks that the term is in the range
or set).
"

Does this help explain my rationale? The reason I'm posting here is that I
imagine there are lots of people with this issue. In particular date ranges
seem to be something that lots of people use but Lucene implements fairly
poorly.

Tim

On 11/10/08 1:58 PM, "Paul Elschot" <paul.elschot@xs4all.nl> wrote:

> Op Monday 10 November 2008 22:21:20 schreef Tim Sturge:
>> Hmmm -- I hadn't thought about that so I took a quick look at the
>> term vector support.
>> 
>> What I'm really looking for is a compact but performant
>> representation of a set of filters on the same (one term field).
>> Using term vectors would mean an algorithm similar to:
>> 
>> String myfield;
>> String myterm;
>> TermVector tv;
>> for (int i = 0 ;  i < maxDoc ; i++) {
>>     tv = reader.getTermFreqVector(i,country)
>>     if (tv.indexOf(myterm) != -1) {
>>           // include this doc...
>>         }
>> }
>> 
>> The key thing I am looking to achieve here is performance comparable
>> to filters. I suspect getTermFremVector() is not efficient enough but
>> I'll give it a try.
>> 
> 
> Better use a TermDocs on myterm for this, have a look at the code of
> RangeFilter.
> 
> Filters are normally created from a slower query by setting a bit in an
> OpenBitSet at "include this doc". Then they are reused for their speed.
> 
> Filter caching could help. In case memory becomes a problem
> and the filters are sparse enough, try and use SortedVIntList
> as the underlying data structure in the cache. (Sparse enough means
> less than 1 in 8 of all docs available the index reader.)
> See also LUCENE-1296 for caching another data structure than the
> one used to collect the filtered docs.
> 
> Regards,
> Paul Elschot
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message