lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Carlson <carl...@bookandhammer.com>
Subject Re: Bug? QueryParser may not correctly interpret RangeQuery text
Date Wed, 05 Jun 2002 07:01:46 GMT
I guess from my perspective we are at

field:[<goop>-><goop>]

The delimiter is not yet defined, but the options currently discussed are
-
->
;
:
|
>

The problem with - and : is that they may be part of a date format.

The action taken by the QueryParser would depend on the type of field we
were using (if that were an easy change). For Date fields, it would convert
the <goop> to a Date using the SimpleDateFormat and try to guess the format
(I think it will handle the ISO 8601 formats).

OR

If adding a type to a field is difficult, then the next option is to just
support a date range and assume the data is a date.

OR

If adding a type to a field is difficult and we don't want to just support a
Date format, then we would create a specific format like
YYYY/MM/DDTHH:MM:SS
For dates and just a set of digits for numbers.


Does that sound about right? If so what's are people preference?


My preferences are 
Solve with Option 3 now, but determine how to solve with option 1.

Delimiter preference would be ">" It seem intuitive to me.

--Peter



On 6/4/02 10:17 PM, "Otis Gospodnetic" <otis_gospodnetic@yahoo.com> wrote:

> Hello,
> 
> Just curious what the status of this issue is, as the discussion seems
> to have stopped.
> 
> --- "Eric D. Friedman" <eric@conveysoftware.com> wrote:
>> Instead of reinventing the wheel for representing dates, how about
>> using an existing standard?  ISO 8601 defines a simple lexical
>> representation for dates, times (with optional millisecond
>> precision),
>> and timezones that is easy to implement.  This is what's used in the
>> XML Schema "dateTime" datatype.
>> 
>> A summary of the ISO 8601 notation is available here:
>> http://www.cl.cam.ac.uk/~mgk25/iso-time.html
>> 
>> The documentation for the XML Schema dateTime datatype is here:
>> http://www.w3.org/TR/xmlschema-2/#dateTime
> 
> I agree, that is why I immediately suggested YYYY-MM-DD.  I dislike
> U.S.-centric or Europe-centric approaches when there is a standard
> format.
> 
>> I whipped up a JavaCC parser to handle this lexical representation
>> (see
>> attachment).
>> 
>> Note that for this to be useful in QueryParser, it's going to need
>> its
>> own lexical state.  This makes sense anyway, since it would be a
>> mistake to have the query syntax infer magical properties about
>> strings
>> that appear to be dates.  Better is to have a keyword in the query
>> syntax that introduces a date value:  something like date(<VALUE>)
>> would work.  So would to_date(<VALUE>) for those who know SQL. I
>> would
>> have suggested date:<VALUE> but I think that already means something
>> in
>> the QueryParser's lexical specification. (I don't actually use
>> QueryParser because the patches I've submitted previously haven't
>> made
>> it in yet, and until they do, QP is fatally crippled for my
>> purposes).
> 
> I'll try to look for your patches in the archives (if you have the URL
> handly please send it to me), so that I can put it on the TODO list, if
> it makes sense to do so.
> As for the above comments about the parser, I'm afraid I'm still a
> JavaCC neophite. I don't dislike date(<VALUE>) approach.  If users can
> grasp field:value they shouldn't have a problem with field:date(value),
> I think.
> 
> Otis
> 
> 
>> On Sun, 2 Jun 2002, Peter Carlson wrote:
>> 
>>> I like this idea of [GOOP:GOOP] as it gives the most flexibility.
>> However,
>>> this requires the field to have a known characteristic like a date
>> field,
>>> number field or text field correct? If you just use the static
>> Field.Date
>>> this would require adding a new attribute the field class? I like
>> this idea
>>> but I don?t know the difficulty / backward compatibility issues.
>>> 
>>> If the extra field attribute is too difficult, then I suggest we
>> use the
>>> nnnn-nn-nn format method so we can use the pattern to determine the
>> data
>>> type.
>>> 
>>> For number fields, should this support only integers, or decimal
>> numbers
>>> too?
>>> 
>>> I don't think we should use the : character, because we probably
>> want to
>>> support time formats in the date format. Something like 03/01/2001
>> at
>>> 00:01:00. Maybe something like ">" or "|" or even "->" ?
>>> 
>>> Also, inclusive vs. exclusive should be accounted for with the [ vs
>> {
>>> characters.  I think this might already be done, but just wanted to
>> throw it
>>> out there.
>>> 
>>> --Peter
>>> 
>>> 
>>> On 6/2/02 2:13 AM, "Brian Goetz" <brian@quiotix.com> wrote:
>>> 
>>>>>> How about:
>>>>>> 
>>>>>>  DATE = nnnn-nn-nn
>>>>>>  NUMBER = n*
>>>>>>  RANGE = [ DATE : DATE ] | [ NUMBER : NUMBER ]
>>>>>> 
>>>>>> An alternate, less parse-oriented approach would be this:
>>>>>>   RANGE = [ GOOP : GOOP ]
>>>>>> where
>>>>>>   GOOP = any string of letters/numbers not containing : or ].
>>>>> 
>>>>> I'd go for the first one as it's more explicit.  However,
>> perhaps the
>>>>> second approach is more extensible?
>>>> 
>>>> When I first did the query parser, I defined terms by inclusion
>>>> (stating valid characters) instead of exclusion (excluding
>> non-term
>>>> characters.)  Turns out I missed quite a few in the first go
>> around,
>>>> which taught me the lesson (again) that sometimes trying to be
>> too
>>>> specific is a rats nest.  What about dates like 02-Mai-2002 (not
>> a
>>>> typo, french for May)?  Letting DateFormat figure it out has some
>>>> merit.
>>>> 
>>>>> DateField(Date) and NumberField(int) sounds right, but wouldn't
>> Field
>>>>> class make more sense?
>>>> 
>>>> I had in mind static methods of Field, just like Field.Text --
>>>> Field.Date, Field.Number.   Sorry if that wasn't clear.  This
>> seems
>>>> an easy addition.
>>>> 
>>>> --
>>>> To unsubscribe, e-mail:
>> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>>>> For additional commands, e-mail:
>> <mailto:lucene-dev-help@jakarta.apache.org>
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> To unsubscribe, e-mail:
>> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>>> For additional commands, e-mail:
>> <mailto:lucene-dev-help@jakarta.apache.org>
>>> 
>>> PARSER_BEGIN(ISO8601Parser)
>> 
>> import java.io.*;
>> import java.util.*;
>> import java.text.*;
>> 
>> public class ISO8601Parser {
>> 
>>   static DateFormat fmt;
>> 
>>   public static void main(String args[]) throws ParseException {
>>     String date;
>> 
>>     //date = "1999-05-31T13:20:00Z";
>>     //date = "1999-05-31T13:20:00-00:01";
>>     date = "1999-05-31T13:20:00.999-08:00";
>> 
>>     TimeZone utc = TimeZone.getTimeZone("UTC");
>>     fmt = DateFormat.getDateTimeInstance();
>>     fmt.setTimeZone(utc);
>> 
>>     ISO8601Parser parser = new ISO8601Parser(new StringReader(date));
>>     Date d = parser.date();
>>     System.out.println(fmt.format(d));
>>   }
>> }
>> 
>> PARSER_END(ISO8601Parser)
>> 
>> TOKEN :
>> {
>>   <#DIGIT: ["0"-"9"]>
>> | <TWOD: <DIGIT><DIGIT>>         // two digits used for day, month,
>> hours, minutes, seconds
>> | <MILLIS: <TWOD><DIGIT>>        // millisecond precision is 000
..
>> 999
>> | <YEAR: <TWOD><TWOD>(<DIGIT>)*> // at least 4 digits, but
possibly
>> more
>> | <DASH: "-">                    // delimiter for CCYY-MM-DD; doubles
>> as minus sign for signed ints
>> | <COLON: ":">                   // delimiter for hh:mm:ss
>> | <DOT: ".">                     // delimiter for ss.mmm
>> (milliseconds)
>> | <T: "T" >                      // delimiter between date and time
>> | <Z: "Z" >                      // UTC timezone
>> | <PLUS: "+">                    // indicates positive offset from
>> UTC
>> }
>> 
>> /**
>>  * Input to this production is a series of tokens matching the
>> following specification:
>>  * CCYY-MM-DD -- a date with no time specification<br>
>>  * CCYY-MM-DDThh:mm:ss -- a timestamp implicitly in the UTC
>> timezone<br>
>>  * CCYY-MM-DDThh:mm:ssZ -- a timestamp explicitly in the UTC
>> timezone<br>
>>  * CCYY-MM-DDThh:mm:ss-08:00 -- a timestamp with a negative 8 hour
>> offset from UTC<br>
>>  * CCYY-MM-DDThh:mm:ss.mmm -- a timestamp with millisecond
>> precision<br>
>>  * -CCYY-MM-DD -- a date whose year is before the common era
>> (BCE)<br>
>>  * NNCCYY-MM-DD -- a date whose year is > 9999<br>
>>  *
>>  * <p> Note that years greater than 9999 are allowed, but that 0000
>> is not a valid year.
>>  * Negative numbers are allowed when representing years BCE.
>>  * </p>
>>  *
>>  * <p>Milliseconds are optional in the seconds field.  The timezone
>> indicator is optional.
>>  * </p>
>>  *
>>  *@return a java.util.Date instance in the UTC timezone, with
>> millisecond precision.
>>  */
>> Date date() :
>> {
>>   int CCYY = 0, MM = 0, DD = 0, hh = 0, mm = 0, ss = 0, millis = 0;
>>   int deltahh = 0, deltamm = 0;
>>   boolean deltaPlus = true;
>>   Calendar c = Calendar.getInstance(TimeZone.getTimeZone("UTC"));
>> }
>> {
>>   CCYY = year() <DASH>
>>   MM = twod() <DASH>
>>   DD = twod()
>>   {
>>     MM--; // months are 0 based
>>     c.set(c.YEAR, CCYY);
>>     c.set(c.MONTH, MM);
>>     c.set(c.DAY_OF_MONTH, DD);
>>   }
>>   (
>>     <T>
>>     hh = twod() <COLON>
>>     mm = twod() <COLON>
>>     ss = twod()
>>     {
>>       c.set(c.HOUR_OF_DAY, hh);
>>       c.set(c.MINUTE, mm);
>>       c.set(c.SECOND, ss);
>>     }
>>     (
>>       <DOT>
>>       millis = millis()
>>       {
>>         c.set(c.MILLISECOND, millis);
>>       }
>>     )?
>>     (
>>       <Z> // we're already in UTC, so no adjustment needed
>>       |
>>       (
>>         (
>>           <PLUS> // somewhere ahead of UTC (east of Greenwich)
>>           |
>>           <DASH> // behind UTC (west of Greenwich)
>>           {
>>             deltaPlus = false;
>>           }
>>         )
>>         deltahh = twod() <COLON>
>>         deltamm = twod()
>>         {
>>           if (! deltaPlus) {
>>             deltahh = -deltahh;
>>             deltamm = -deltamm;
>>           }
>>           // millisecond offset
>>           int offsetFromUTC = ((deltahh * 60) + deltamm) * 60 * 1000;
>>           c.set(c.ZONE_OFFSET, offsetFromUTC);
>>         }
>>       )
>>     )?
>>   )?
>>   {
>>     return c.getTime();
>>   }
>> }
>> 
>> int millis() :
>> {
>>   Token t;
>> }
>> {
>>   t = <MILLIS> {
>>     return Integer.parseInt(t.image);
>>   }
>> }
>> 
>> int twod() :
>> {
>>   Token t;
>> }
>> {
>>   t = <TWOD> {
>>     return Integer.parseInt(t.image);
>>   }
>> }
>> 
>> int year() :
>> {
>>   Token t;
>>   boolean positive = true;
>> }
>> {
>>   (
>>     <DASH>
>>     {
>>       positive = false;
>>     }
>>   )?
>>   t = <YEAR> {
>>     int year = Integer.parseInt(t.image);
>>     if (year == 0) {
>>       throw new IllegalArgumentException("0000 is not a legal year");
>>     }
>>     return positive ? year : -year;
>>   }
>> }
>>> --
>> To unsubscribe, e-mail:
>> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Yahoo! - Official partner of 2002 FIFA World Cup
> http://fifaworldcup.yahoo.com
> 
> --
> To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
> 
> 


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message