lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Sturge <tstu...@hi5.com>
Subject Re: Term numbering and range filtering
Date Tue, 18 Nov 2008 23:43:56 GMT
I've finished a query time implementation of a column stride filter, which
implements DocIdSetIterator. This just builds the filter at process start
and uses it for each subsequent query. The index itself is unchanged.

The results are very impressive. Here are the results on a 45M document
index:

Firstly without an age constraint as a baseline:

Query "+name:tim" 
startup: 0 
Hits: 15089
first query: 1004
100 queries: 132 (1.32 msec per query)

Now with a cached filter. This is ideal from a speed standpoint but there
are too many possible start/end combinations to cache all the filters.

Query "+name:tim age:[18 TO 35]" (ConstantScoreQuery on cached RangeFilter)
startup: 3
Hits: 11156
first query: 1830
100 queries: 287 (2.87 msec per query)

Now with an uncached filter. This is awful.

Query "+name:tim age:[18 TO 35]" (uncached ConstantScoreRangeQuery)
startup: 3
Hits: 11156
first query: 1665
100 queries: 51862 (yes, 518 msec per query, 200x slower)

A RangeQuery is slightly better but still bad (and has a different result
set)

Query "+name:tim age:[18 TO 35]" (uncached RangeQuery)
startup: 0
Hits: 10147
first query: 1517
100 queries: 27157 (271 msec is 100x slower than the filter)

Now with the prebuilt column stride filter:

Query "+name:tim age:[18 TO 35]" (ConstantScoreQuery on prebuilt column
stride filter)
startup: 2811
Hits: 11156
first query: 1395
100 queries: 441 (back down to 4.41msec per query)

This is less than 2x slower than the dedicated bitset and more than 50x
faster than the range boolean query.

Mike, Paul, I'm happy to contribute this (ugly but working) code if there is
interest. Let me know and I'll open a JIRA issue for it.

Tim


On 11/11/08 1:27 PM, "Michael McCandless" <lucene@mikemccandless.com> wrote:

> 
> Paul Elschot wrote:
> 
>> Op Tuesday 11 November 2008 21:55:45 schreef Michael McCandless:
>>> Also, one nice optimization we could do with the "term number column-
>>> stride array" is do bit packing (borrowing from the PFOR code)
>>> dynamically.
>>> 
>>> Ie since we know there are X unique terms in this segment, when
>>> populating the array that maps docID to term number we could use
>>> exactly the right number of bits.  Enumerated fields with not many
>>> unique values (eg, country, state) would take relatively little RAM.
>>> With LUCENE-1231, where the fields are stored column stride on disk,
>>> we could do this packing during index such that loading at search
>>> time is very fast.
>> 
>> Perhaps we'd better continue this at LUCENE-1231 or LUCENE-1410.
>> I think what you're referring to is PDICT, which has frame exceptions
>> for values that occur infrequently.
> 
> Yes let's move the discussion to Jira.
> 
> Actually I was referring to simple bit-packing.
> 
> For encoding array of compact enum terms (eg city, state, color, zip)
> I'm guessing the exceptions logic won't buy us much and would hurt
> seeking needed for column-stride fields.  But we should certainly test
> it.
> 
> Mike
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message