lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] RangeQuery and multi-value fields
Date Thu, 23 Jun 2011 01:51:18 GMT
> On Tue, Jun 21, 2011 at 12:42:43AM -0500, Peter Karman wrote:
> > I want to override the behavior of the RangeQuery class to support my pseudo
> > multi-value fields, which I achieve by concatenating values with the \x03 byte.

OK, there's another option which has suddenly become more attractive. :)  My
Eventful colleague Dan Markham has submitted a trie implementation that can be
used for generating numeric ranges:

    https://issues.apache.org/jira/browse/LUCY-159

It is to some degree based on the algorithm used by Lucene's NumericRangeQuery:

    http://s.apache.org/QOx

We can potentially use these two sources as initial implementations to build a
module that does what you need it to.

Say that you have the following documents:

    { id => 'a', nums => "199", trie => "1xx 19x 199" }
    { id => 'b', nums => "209", trie => "2xx 20x 209" }
    { id => 'c', nums => "211", trie => "2xx 21x 211" }

The "trie" field contains prefixes derived from the value within the "num"
field.  

The presence of the trie field allows us to convert searches for ranges into
searches for collections of terms.

   100 .. 199  =>  1xx
   100 .. 119  =>  10x OR 11x
   200 .. 300  =>  2xx OR 300
   410 .. 422  =>  41x OR 420 OR 421 OR 422

In the example above, the "trie" numbers use base 10 when determining their
prefixes.  In NumericRangeQuery, there is a "precisionStep" measured in bits
which is used to determine the prefixes.  (The materials Dan submitted use a
hard-coded base 3, which happened to be best for our esoteric purposes.)

The samples above only have one number in the "nums" field, but you can have
as many as you want:
    
    { id => 'd', nums => "205 351", trie => "2xx 20x 205 3xx 35x 351" }

A range query for 200..300 (encoded as "2xx OR 300") would match that
document, as would a range query for 351..353 (encoded as
"351 OR 352 OR 353").

Range queries built this way max out at a workable number of OR'd terms, as
opposed to the "dumb" alternative of expressing a range like "400..500" as 101
OR'd terms ("400 401 402 [...] 500")  The NumericRangeQuery documentation
explains the tuning tradeoffs.

Marvin Humphrey


Mime
View raw message