lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damian Gajda <dga...@caltha.pl>
Subject Re: the future of DateField
Date Tue, 17 Aug 2004 17:34:07 GMT
Hello,

> as we all know, DateField currently leads to problems with range queries 
> and prefix queries as it saves dates with millisecond precision. I suggest 
> the following changes:
> 
> -Deprecate everything currently in DateField
> 
> -Add these methods to DateField:
> 
>   public static String dateToString(Date date, DateResolution resolution)
>   public static String timeToString(long time, DateResolution resolution)
> 
> These will return strings in format "YYYYMM" if DateResolution is set to 
> DateResolution.MONTH, "YYYYMMDDHHMM" if DateResolution is set to 
> DateResolution.MINUTE etc. DateResolution is a typesafe enumeration. 
> There's no default resolution, i.e. we force users to think about which 
> resolution they need. The format is slightly longer than the radix one 
> that's currently used, but it makes debugging easier. If a Date object is 
> used, dates before 1970 can be indexed (which is not the case now).

Recently I have been haing a real fight with RangeQueries on date
fields. The problem I encountered came from RangeQuery expansion.
Indexes I am searching through are not very big but have a lot of
documents with date fields. Because of that I hit the wall with the
BooleanQuery limitation (which can be passed by risking creation of a
VERY BIG query). I thought that lowering the precision of date fields
would help me - I was wrong. The number of different dates my users
generate is almost the same as the number of documents they produce.
Since the dates are separated by hours - lowering the prescision does
not help with lowering the number of generated terms.

The only real solution would be a different approach for indexing dates,
but this I think is not a simple one. Probably best would be something
based on prefixes (but not the way it is handled by lucene right now -
ie. query expansion, but in index structures).

Of course if one would like to implement different handling for
"prefixed (prefixable) terms" (dunno how to name it) I am willing to
help, but only if time allows :(

Daniel's suggestions to store dates in a human readable format are nice
when it comes to querying the indexes by hand and also giving the
ability to use prefix queries (give me stuff that has dates in 2003 - ie
2003*). But on the other hand it has some disadvantages. One is sorting.

My users need to sort search results by dates and because the system is
under heavy load, I needed to decrease processing time and memory space
used by sorting code. That is why I had to move from dates represented
the way Daniel suggests - to decimal integer numbers. This creates very
ugly looking "date" strings but needs only 4bytes per term while
sorting. That IS a memory advantage. Trying to fit a date represented as
200401011456 into int value does not work, since the largest number in
int is 2147483647 (that is at least 2 positions too small).

Maybe the solution for a problem with date sorting would be an extension
to field info file. The field info would store an information of a type
of the values it represents. Just basic ones so that discovery of types
while sorting does not rely on blind luck.

To sum up. In order to create a decent date support for Lucene - one
needs to take following problems into account:
- precision (maybe some people will need smaller indexes and can give
precision up)
- sorting (this is probably a must for many of us)
- date term readability (who wouldn't like to use some of QueryParsers'
features :))
- usage of prefix queries (give me august 2004 :) please)
- usage of range queries (what the hell happend last week)
- methods of expansion (maybe execution) of prefix and range queries
involving date fields (this is not so funny ;))

Just my 2cents :)

Regards,
-- 
Damian Gajda



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message