lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Halácsy Péter <halacsy.pe...@axelero.com>
Subject RE: Re(2): Re: [Lucene-dev] Katakana characters in queries (a bug?)
Date Wed, 31 Oct 2001 08:23:57 GMT
Hello,


> -----Original Message-----
> From: Doug Cutting [mailto:DCutting@grandcentral.com]
> >
> > I think  IDENTIFIER_CHAR doesn't need to be the first char so my
> > proposal is:
> > <TERM:   ( ~["\"", " ", "\t", "(", ")", ":", "&", "|", "^", 
> "*", "?",
> > "~", "{", "}", "[", "]" ] )+ >
> 
> That looks like the right approach to me.
> 
> > On the other hand IDENTIFIER, ALPHA_CHAR, ALPHANUM_CHAR tokens are
> > definied but are not used.
> 
> So let's remove them!
I also removed _NEWLINE, _QCHAR and _RESTOFLINE. They weren't used.

> > These changes yield the following token definitions in
QueryParser.jj:
> 
> <*> TOKEN : {
>   <#_NUM_CHAR:   ["0"-"9"] >
> | <#_TERM_CHAR: ~["\"", " ", "\t", "(", ")", ":", "&", "|",
>                   "^", "*", "?", "~", "{", "}", "[", "]" ] >
> | <#_NEWLINE:    ( "\r\n" | "\r" | "\n" ) >
> | <#_WHITESPACE: ( " " | "\t" ) >
> | <#_QCHAR:      ( "\\" (<_NEWLINE> | ~["a"-"z", "A"-"Z", 
> "0"-"9"] ) ) >
> | <#_RESTOFLINE: (~["\r", "\n"])* >
> }
> 
> <DEFAULT> TOKEN : {
>   <AND:       ("AND" | "&&") >
> | <OR:        ("OR" | "||") >
> | <NOT:       ("NOT" | "!") >
> | <PLUS:      "+" >
> | <MINUS:     "-" >
> | <LPAREN:    "(" >
> | <RPAREN:    ")" >
> | <COLON:     ":" >
> | <CARAT:     "^" >
> | <STAR:      "*" >
> | <QUOTED:     "\"" (~["\""])+ "\"">
> | <NUMBER:    (["+","-"])? (<_NUM_CHAR>)+ "." (<_NUM_CHAR>)+ >
> | <TERM:      (<_TERM_CHAR>)+ >
> | <FUZZY:     "~" >
> | <WILDTERM:  <_TERM_CHAR>
>               ( ~["\"", " ", "\t", "(", ")", ":", "&", "|", 
> "^", "~", "{",
> "}", "[", "]" ] )+ <_TERM_CHAR>>
> | <RANGEIN:   "[" (~["]"])+ "]">
> | <RANGEEX:   "{" (~["}"])+ "}">
> }
> 
> <DEFAULT> SKIP : {
>   <<_WHITESPACE>>
> }
> 
> Can folks try these and tell me if it solves the problem?
> 

I tried but didn't solve all problems because the generated parser can't
handle _not_ ISO-LATIN1 characters! Some accented characters are
definied in ISO-LATIN1 but for example two Hungarian characters are in
ISO-LATIN2. (I don't know Katakana characters).

The QueryParser.jj file in the cvs uses ASCII_CharStream (Latin1):
"ASCII_CharStream generated when neither of the two options -
UNICODE_INPUT or JAVA_UNICODE_ESCAPE is set. 
This class treats the input as a stream of 1-byte (ISO-LATIN1)
characters. Note that this class can also be used to parse binary files.
It just reads a byte and returns it as a 16 bit quantity to the lexical
analyzer. So any character returned by this class will be in the range
'\u0000'-'\u00ff'. " (source:
http://www.webgain.com/products/java_cc/charstream.html)

I prefer Unicode since the common use of QueryParser is through it's
String contructor and this string is fed to a StringReader object (that
can return not ascii characters). That's why I added UNICODE_INPUT=true
option.

I tested with some unicode specific characters (for example Unicode 337)
and I had good results.

I attached the modified jj file perhaps it can help.

> Ideally we should add some cases for this to the junit tests, 
> but I can't
> get junit to work at all right now...  Have the junit tests ever run
> correctly from ant since the move to Jakarta?  Can someone 
> more familiar
> with junit have a look at this?
> 
> Doug
> 

peter


Mime
View raw message