lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Halácsy Péter <halacsy.pe...@axelero.com>
Subject RE: Re(2): Re: [Lucene-dev] Katakana characters in queries (a bug?)
Date Sat, 27 Oct 2001 09:10:50 GMT
Hello,
I think the token definition list has some problem that causes the
ParseException if a term starts with any not English character.
Joanne's solution helps in case of three other chars but do not helps
for other.

A TERM is definied as:
<TERM:      <_IDENTIFIER_CHAR> 
              ( ~["\"", " ", "\t", "(", ")", ":", "&", "|", "^", "*",
"?", "~", "{", "}", "[", "]" ] )* >

That means a terms begin with an IDENTIFIER_CHAR and has other chars.  

I think  IDENTIFIER_CHAR doesn't need to be the first char so my
proposal is:
<TERM:   ( ~["\"", " ", "\t", "(", ")", ":", "&", "|", "^", "*", "?",
"~", "{", "}", "[", "]" ] )+ >

On the other hand IDENTIFIER, ALPHA_CHAR, ALPHANUM_CHAR tokens are
definied but are not used.

peter

ps: I don't understand the definition of WILD_TERM. It states that a
wild term must end with identifier_char, so cannot end with *. Is it the
right definition?

> -----Original Message-----
> From: joanne.sproston@teamware.co.uk 
> [mailto:joanne.sproston@teamware.co.uk]
> Sent: Friday, October 26, 2001 6:42 PM
> To: Doug Cutting; 'Brian Goetz'; lucene-dev@jakarta.apache.org
> Subject: Re(2): Re: [Lucene-dev] Katakana characters in 
> queries (a bug?) 
> 
> 
> I'm not sure if this is the answer you are lookin for - but I 
> overcome a
> similar problem for Finnish characters by modifying the 
> queryparser.jj file to
> contain the following lines :
> 
> /* ***************** */
> /* Token Definitions */
> /* ***************** */
> 
> <*> TOKEN : {
>   <#_ALPHA_CHAR: ["a"-"z", "A"-"Z", "ä", "ö", "Ä", "Ö", "å", "Å"] >
> | <#_NUM_CHAR:   ["0"-"9"] >
> | <#_ALPHANUM_CHAR: [ "a"-"z", "A"-"Z", "0"-"9", "ä", "ö", 
> "Ä", "Ö", "å", "Å" ]
> >
> | <#_IDENTIFIER_CHAR: [ "a"-"z", "A"-"Z", "0"-"9", "_", "ä", 
> "ö", "Ä", "Ö", "å",
> "Å" ] >
> | <#_IDENTIFIER: <_ALPHA_CHAR> (<_IDENTIFIER_CHAR>)* >
> | <#_NEWLINE:    ( "\r\n" | "\r" | "\n" ) >
> | <#_WHITESPACE: ( " " | "\t" ) >
> | <#_QCHAR:      ( "\\" (<_NEWLINE> | ~["a"-"z", "A"-"Z", 
> "0"-"9", "ä", "ö",
> "Ä", "Ö", "å", "Å"] ) ) >
> | <#_RESTOFLINE: (~["\r", "\n"])* >
> }
> 
> <DEFAULT> TOKEN : {
>   <AND:       ("AND" | "&&" | "and") >
> | <OR:        ("OR" | "||" | "or") >
> | <NOT:       ("NOT" | "!" | "not") >
> | <PLUS:      "+" >
> | <MINUS:     "-" >
> | <LPAREN:    "(" >
> | <RPAREN:    ")" >
> | <COLON:     ":" >
> | <CARAT:     "^" >
> | <STAR:      "*" >
> | <QUOTED:     "\"" (~["\""])+ "\"">
> | <NUMBER:    (<_NUM_CHAR>)+ "." (<_NUM_CHAR>)+ >
> | <TERM:      <_IDENTIFIER_CHAR>
>               ( ~["\"", " ", "\t", "(", ")", ":", "&", "|", 
> "^", "*" ] )* >
> }
> 
> <DEFAULT> SKIP : {
>   <<_WHITESPACE>>
> }
> 
> <DEFAULT> TOKEN : {
> <ALL:       (~[]) >
> }
> 
> 
> 
> 
> 
> Doug Cutting  (22/10/2001  16:39):
> >Brian,
> >
> >Do you know what's going on here?  I have not yet had time 
> to look at this.
> >If you don't have time, and no one else volunteers, then I 
> will look into
> >it.  I would like fix this for the 1.2 final release, if the 
> change required
> >is not major.
> >
> >Doug
> >
> >> -----Original Message-----
> >> From: Ralf.Zimmermann@cit.de [mailto:Ralf.Zimmermann@cit.de]
> >> Sent: Monday, October 22, 2001 8:56 AM
> >> To: lucene-dev@jakarta.apache.org
> >> Subject: Re: Re: [Lucene-dev] Katakana characters in 
> queries (a bug?)
> >>
> >>
> >>
> >>
> >> Hi,
> >>
> >> yes, I can confirm this bug. I have the same problem
> >> with query terms starting with german umlauts like 'ä', 'ö'
> >> and 'ü':
> >>
> >> Exception occurred during event dispatching:
> >> org.apache.lucene.queryParser.TokenMgrError: Lexical error at
> >> line 1, column 1.
> >> Encountered: "\u00f6" (246), after : ""
> >>      at
> >> org.apache.lucene.queryParser.QueryParserTokenManager.getNextT
> >> oken(Unknown
> >> Source)
> >>      at
> >> org.apache.lucene.queryParser.QueryParser.jj_ntk(Unknown Source)
> >>      at
> >> org.apache.lucene.queryParser.QueryParser.Modifiers(Unknown Source)
> >>      at
> >> org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
> >>      at
> >> org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
> >>      at
> >> org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
> >>      ...
> >>
> >> The problem occurres in lucene 1.2 RC1 and RC2.
> >>
> >> Regards,
> >> Ralf Zimmermann
> >>
> >>
> 
> --------------------------------------------------------------
> ----------
> 
> Joanne Sproston | Software Engineer
> Teamware Group
> joanne.sproston@teamware.co.uk
> phone: +44 (0)1782 794879  fax: +44 (0)1782  776667
> 
> intra / extra / Internet solutions at www.teamware.com
> 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message