lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Don Gilbert <gilbe...@bio.indiana.edu>
Subject Another way to handle large numeric range queries
Date Wed, 09 Jun 2004 02:55:48 GMT

I ran into this problem using current Lucene implementation
of rangeQuery applied to genome data (search a chromosome
range from 1..20MB).  We wanted to use lucene queries like

  +organism:fruitfly +chromosome:X +location:[1000000 5000000]
  
to find all the genome features (1000s to 100,000s) that are
listed in some megabase range of a genome.  This failed
quickly with small ranges using the basic Lucene RangeQuery.
My solution was to scores each document that falls in the
query range into a BitSet:

class NumRangeQuery extends Query
  public NumRangeQuery(Term first, Term last, boolean inc);

-- full numeric (integer) range query, can handle large ranges.
-- makes a BitSet of documents within range once, and feeds back to
Searcher thru score(HitCollector c, int end) as often as called.
-- query semantics are same as for RangeQuery
-- implicit assumptions are
 -- first, last Term have integer values, as does indexed field
 -- indexed field is recoded for alphanumeric sorting;
   e.g.  2 -> 0000000002, 10 -> 0000000010, -3 ->  -0000000003

Find this as part of the 'LuceGene' package for searching
genome and bioinformatics databases at http://www.gmod.org/lucegene/
with lucene related source code in cvs here:

http://cvs.sourceforge.net/viewcvs.py/gmod/lucegene/src/org/eugenes/index/
NumRangeQuery.java -- range searches of integer fields.
LGQueryParser.java -- extension of QueryParser for NumRangeQuery (& other)
BioDataAnalyzer.java -- NumberField formats field for indexing 

-- Don Gilbert
> Date: Tue, 18 May 2004 13:35:55 -0700
> From: Andy Goodell <goodell@gmail.com>
> Subject: How to handle range queries over large ranges and avoid Too Many Boolean cla
> 
> In our application we had a similar problem with non-date ranges until
> we realized that it wasnt so much that we were searching for the
> values in the range as restricting the search to that range, and then
> we used an extension to the org.apache.lucene.search.Filter class, and
> our implementation got much simpler and faster.

-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- gilbertd@indiana.edu--http://marmot.bio.indiana.edu/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message