lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Armbrust, Daniel C." <Armbrust.Dan...@mayo.edu>
Subject RE: Numeric Support
Date Fri, 26 Jul 2002 16:09:13 GMT
My current use cases for numbers mainly include dates, (in the format YYYYMMDD) but also some
numbers (only up to 3 digits long, however) and we have used the padding with 0's method.


Apologies if any of my assumptions about how the filtering mechanisms work, I definitely don't
have a lot of knowledge of the inner workings of lucene.

I don't know what a "good" numbers implementation is, but the way that I do it now, with filters
on the bit set after they come back just feels like a hack.  Even if bit sets are very fast,
it doesn't seem right to iterate over nearly the entire set of terms to filter them when I
ask for results with a number 000050 < x < 050000.  It seems like that shouldn't be
put into the term enumeration in the first place, rather than having to filter them out.

It doesn't seem to scale very well, though I have no tests or data to back this up.  Admittedly,
it has worked for us thus far.

I'm concerned, however, if we start to put in more data, (especially non integer data) by
doing something like multiplying by 10,000 (or whatever the decimal shift needs to be, plus
it gets even more hackish if I have to add to all the values to make all the negative values
positive) and then padding out to X digits, and start chaining together multiple filters on
multiple different number fields our performance is going to very significantly degrade. 


Since I am working with an index that is currently about 12 GB, I have to look very closely
at this.  I'm sure that blue prints for a real number field would have to involve positive
and negative decimal values, with full support for returning range matches on them.  That
would better fit my needs, but from your (the lucene designers) perspective, I'm sure there
are more features/requirements etc that should be taken into account that I can't think of
right now.

While I'm typing, I should add a Thank you to all of you designers/contributors.  Lucene really
is a great program.  And the amount of support on the list highly impresses me.  I don't know
how much scalability was put at the top of the design goals when Lucene's blueprints were
laid out, but we have over 12 million documents (about 80 GB worth of XML) that we have indexed,
and have not even had to go to any great lengths to make the performance good.  It was good
right out of the box.

Dan



-----Original Message-----
From: Peter Carlson [mailto:carlson@bookandhammer.com]
Sent: Friday, July 26, 2002 9:38 AM
To: Lucene Users List
Subject: Numeric Support


What would be the criteria for Numeric support?

Currently we are looking to add the ability to hack around adding numeric
support by creating a NumberField (exactly analogous to DateField).

The idea right now is that it would pad the numbers to some level, and only
handle integers at the beginning.

If you mean behind the scenes numeric support where you can pass in a number
to a document field, or be able to know that a given token should be a
number and be mixed with text in the same field, I understand why that would
substantial.

However, just being able to search for a number by converting it to a
standard text format, and sort by that field correctly, I think that is on
the way, although slowly. This idea also supports range searches since the
number in text form will be alphanumerically ordered like numbers. Any math
will be difficult though.

This will require the query string that contains a number to be converted to
the standard number format using a static method like NumberField. Also it
will require a separate field that contains only number formatted terms.

I hope that helps. 

Does this meet any of your criteria for number support?

--Peter



--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message