lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <>
Subject Re: Number range search through Query subclass
Date Sun, 16 Feb 2003 03:46:55 GMT
On Friday 14 February 2003 02:58, Volker Luedeling wrote:
> Hi,
> I am writing an application that constructs Lucene searches from XML
> queries. Each item from the XML is represented by a Query of the
> corresponding type. I have a problem when I try to search for number
> ranges, since RangeQuery compares strings, not numbers, so 15 < 155 < 20.
> What I need is a subclass of Query that evaluates numbers correctly. I have
> tried subclassing RangeQuery, MultiTermQuery or Query directly, but each
> time I have run into problems with inheritance and access rights to various
> methods or inner classes. 
> Does anyone know of a solution to this problem? If there is none, the only
> way I can think of would be indexing numbers as something like "#15#". But
> it's not a very elegant solution when all I need is a slight variation of
> one existing class. 
> Thanks for any help you can offer,

Actually the problem is not (just) the query, it's tokenizer/analyzer/indexer 
as well. For range query to work, tokens have to be correctly ordered 
lexically (~= in alphabetic order). I don't think using #s as markers would 
work, as they do not make tokens get ordered properly (plus, most analyzers 
would just remove those chars).

The usual way to do this is to use suitable numeric format for indexed data; 
for dates format like YYYY-MM-DD works ok (ie. dates are correctly ordered 
when ordering date tokens alphabetically), for other numbers (like 
timestamps) what is usually done is padding, so that numbers in your case
could be "015", "155" and "20" (instead of leading 0 any other letter that is 
before '1' in alphabetic order would do). So, you need to know biggest number 
you'd need to index and use appropriate zero padding.

Now, if you store these numbers as single values in separate index, padding is 
easy to do. If you are trying to get random numeric data contained in 
otherwise plain text content, things are bit more complicated.

Hope this helps,

-+ Tatu +-

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message