lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From markharw00d <>
Subject Re: [jira] Created: (LUCENE-798) Factory for RangeFilters that caches sections of ranges to reduce disk reads
Date Tue, 13 Feb 2007 08:42:21 GMT
Chris Hostetter wrote:
> i haven't read the patch, but based on the jira description i don't think
> this is attempting to reuse cached info across updates -- i think it's
> just trying to address the issue of eliminating redundent TermEnum/TermDoc
> iterating for similar ranges.
Correct. I have conveniently ignored the issue of index updates and 
concerned myself with how to best handle similar (but not exactly the 
same) queries.
The logic is this:
* Certain types of field (dates/location coordinates) are commonly 
accessed as a range rather than as individual terms
* Reading these ranges requires reading a lot of info off disk
* To avoid reading too much disk we cache a set doc ids (typically in a 

The options for caching these sets of doc ids are:
1) Caching sets representing a particular query - cache hits are likely 
to be low because users request different ranges of values and they 
would need to be exactly the same start/end range for such a scheme to work.
2) Caching sets representing uniform ranges of values e.g. years/months 
- this requires knowledge of the term value format plus some ranges may 
be very sparse e.g. months or years with no content - when using bitsets 
for caching this is inefficient.
3) Caching sets representing non-uniform value ranges depending on index 
content distribution

Option 3  is the approach I took. Initialization involves ripping 
through a TermEnum counting the total number of docs the terms 
represent. I then create a collection of ascending term ranges (e.g. 
20010101 to 20020327, 20020328 to 20020415) which represent 
equally-sized numbers of doc references and therefore a guarantee of a 
reasonably non-sparse set for caching purposes.
At query time I create range filters dynamically by comparing the query 
range against these partitions to see which ranges are wholly contained 
- these are deemed cache hits. Each range records the number of "cache 
hits" and bitsets are cached for only the top "n" ranges that seem 
popular. Unpopular ranges continue to hit the disk (via 
TermEnum/TermDocs) and do not cache this info. Popular ranges that have 
a bitset cache "OR" their cached bitset into the final filter's bitset.

Performance seems to vary depending on the number of ranges you 
configure, the number of bitsets you allow to be cached and the query 
ranges in use.


The approach I took therefore is to cache

> : Given, a sample date range:
> : 01/01/2001 to 01/01/2007
> : has a bit set associated.
> :
> : If I did a query for
> : 12/31/2000 to 01/01/2007
> :
> : the current filter code would cause a new filter to be created (which
> : is what I assume you are attempting to solve). Under you solution I
> i don't think it's an attempt at reducing the number of bitsets, just at
> reducing the amount of time spent computing those bitsets -- with the
> filter on "01/01/2001 to 01/01/2007" already in the cache, the filter on
> "12/31/2000 to 01/01/2007" only needs to look at the terms between
> 12/31/2000 and 01/01/2001 - and can and the result with the existing
> bitset.
> -Hoss
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message