lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <>
Subject [jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib
Date Wed, 03 Dec 2008 13:27:44 GMT


Uwe Schindler commented on LUCENE-1470:

I have some first "benchmarks" on the three trie implementations. They were done using three
indexes with same content, but encoded in 8bit, 4bit and 2bit, containing real-world data
in 575,000 documents of the PANGAEA repository (see as mentioned before). The
same range queries were executed after some warmup and time was measured until TopDocs returned
the first 10 documents.

The indexes each contain 13 numeric, tree encoded fields (doubles and Dates). Index size (including
the "normal" fields) was:
- 8bit: 4.8 GiB
- 4bit: 5.1 GiB
- 2bit: 5.7 GiB

So the addition 13*8*575,000 extra trie terms (with longer) size took 300 MB going from 8
to 4bit. Going from 4 to 2bit, the additional 13*16*575,000 extra terms took 600 MB (additional
to the step from 8 to 4 bit, but its linear to the number of additional terms).

The performance impact was neglectible (or better the standard deviation was bigger than the
performance plus), so I think still, 8bit is a good idea and index size is the smallest.

My idea, why this is so: Using fewer bits increases number of terms in index (I do not know
how this decreases performance), needs more TermEnum seeks (for each trie precision, 2 seeking
operations are needed). These term enum seeks are faster for 8bit trie varaint, because the
2 chars prefix can be used extensively (first prefix=precision, second char = first byte of
numeric value). With 4bit and 2bit you get only 4bit or 2bits in addition to precision marker.
So seeks in term enum are slower.

In comparison a ConstantScoreRangeQuery on the full precision trie-coded values took about
10-20 times longer. Here the numbers:
For PANGAEA, all queries are half open (we have the problem, that data sets have ittself bounding
boxes - minlatitude, maxlatitude, minlongitude, maxlongitude) and the user searches for bounding
boxes, hits are datasets that *intersect*  the search bounding box. So for a complete lat/lon
constraint you have 4 half open ranges (with the current Trie implementation, this is not
a problem, because two ANDed filters just withit half of the number of terms needed for a
full range. The only performance impact is ANDing the two OpenBitSets). So 4 such half-open
ranges ANDed together take the following times on the given index after some warmup, inkl
fetching the first 10 documents:

ConstantScoreRange: 1500-4000 ms
TrieRangeQuery 8bit: 40ms-100ms
TrieRangeQuery 4bit: 30ms-80ms
TrieRangeQuery 2bit: 20ms-80ms

The old RangeQuery was not tested, but that hits the MaxClauseCount constraint, and if this
is raised to Integer.MAX_VALUE, it tooks approx 1 minute to finish). The numbers are pure
rnage queries, if you additionally add some search terms, time goes up a little bit. Used
was only TrieRangeQuery (so rewrite to constant score), no docs were filtered in the classical
way: termqueries + filtering of results.

More benchmark results follow, when I finish the benchmarker. But these are real world examples.
The search engine machine was a Sun X4600 machine with 16 opteron cores, 64bit Java 1.5, Sun
Java System Web Server, -Xmx 8192M (our current configuration). On lower cost machnies like
desktop pcs, the ConstantScoreRangeQuery takes much more time, but TrieRangeQuery is approx
the same time, as disk io seeks is more the limiting factor. Processor speed and number of
processors is not the limiting factor (if only one query is run in parallel).

> Add TrieRangeQuery to contrib
> -----------------------------
>                 Key: LUCENE-1470
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.4
>            Reporter: Uwe Schindler
>            Assignee: Michael McCandless
>         Attachments: LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch,
> According to the thread in java-dev (
and, I want to include my fast
numerical range query implementation into lucene contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 500000 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message