lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Isakson" <Eric.Isak...@sas.com>
Subject RE: problem with non latin characters in the query
Date Mon, 04 Nov 2002 19:15:52 GMT
Olivier,

I'm no expert on this by any means, but I poked around in the sources this morning trying
to understand where this problem may be occurring as I'm trying to get familiar with any internationalization
problems I'm going to run into with Lucene. This message rambles on a bit, but follows my
train of thought as I looked at this problem.

It doesn't look to me like the analyzer will have anything to do with it. The problem occurrs
somewhere inside org.apache.lucene.queryParser.QueryParser.jj lexical analysis so I looked
there to get a better understanding of how that works. 

To process your query, QueryParser reads your query using a StringReader. So, my first question
to you is, did the query make it from your UTF-16 query file and get transcoded properly to
the java string (for instance using an InputStreamReader with a constructor that set the encoding
to UTF-16 or some String constructor where you supplied a byte[] and charset).

Assuming that part was handled properly, we next need to look at the query parser's grammar
for problems, perhaps the character you wish to use is not part of the grammar for a token
and as for the start characters giving the same error, we should look here too:

<*> TOKEN : {
  <#_NUM_CHAR:   ["0"-"9"] >
| <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^",
                          "[", "]", "\"", "{", "}", "~", "*", "?" ] >
| <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", "^",
                           "[", "]", "\"", "{", "}", "~", "*", "?" ]
                       | <_ESCAPED_CHAR> ) >
| <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >
| <#_WHITESPACE: ( " " | "\t" ) >
}

<DEFAULT, RangeIn, RangeEx> SKIP : {
  <<_WHITESPACE>>
}

<DEFAULT> TOKEN : {
  <AND:       ("AND" | "&&") >
| <OR:        ("OR" | "||") >
| <NOT:       ("NOT" | "!") >
| <PLUS:      "+" >
| <MINUS:     "-" >
| <LPAREN:    "(" >
| <RPAREN:    ")" >
| <COLON:     ":" >
| <CARAT:     "^" > : Boost
| <QUOTED:     "\"" (~["\""])+ "\"">
| <TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
| <FUZZY:     "~" >
| <SLOP:      "~" (<_NUM_CHAR>)+ >
| <PREFIXTERM:  <_TERM_START_CHAR> (<_TERM_CHAR>)* "*" >
| <WILDTERM:  <_TERM_START_CHAR>
              (<_TERM_CHAR> | ( [ "*", "?" ] ))* >
| <RANGEIN_START: "[" > : RangeIn
| <RANGEEX_START: "{" > : RangeEx
}

... there is a bit more to this, see the org.apache.lucene.queryParser.QueryParser.jj file
for the rest of the details ...

So, TERM is a _TERM_START_CHAR followed optionally by a series of _TERM_CHAR
_TERM_CHAR is either _TERM_START_CHAR or _ESCAPED_CHAR
and _TERM_START_CHAR is the compliment of several significant query characters or an _ESCAPED_CHAR.
Hmm...since _TERM_START_CHAR includes _ESCAPED_CHAR, why do we need a separate definition
of _TERM_START_CHAR and _TERM_CHAR?

Assuming your string was read with the right character set, this leaves me wondering if the
compliment operator in the JavaCC grammar did the right thing or maybe it is still something
with the reader. Note also that the QUOTED production occurs before TERM but also uses the
compliment operator, so you may be able to work around the term start problem you were having
if you quote your terms. Just guessing here, I'm new to javaCC.

The QueryParser.jj file sets a few javacc options and the javacc task in build.xml has the
opportunity to set others but doesn't:

from QueryParser.jj:
options {
  STATIC=false;
  JAVA_UNICODE_ESCAPE=true;
  USER_CHAR_STREAM=true;
}

from build.xml:
    <javacc
      target="${src.dir}/org/apache/lucene/queryParser/QueryParser.jj"
      javacchome="${javacc.zip.dir}"
      outputdirectory="${build.src}/org/apache/lucene/queryParser"
    />

javacc task has an optional parameter unicodeinput (http://jakarta.apache.org/ant/manual/OptionalTasks/javacc.html)
that I got curious about, so I went and read the doc on the javacc options (note this is from
the webgain site and is not the version used by lucene, though I'd expect these options to
have the same behavior) http://www.webgain.com/products/java_cc/javaccgrm.html#prod2 states:

STATIC: This is a boolean option whose default value is true. If true, all methods and class
variables are specified as static in the generated parser and token manager. This allows only
one parser object to be present, but it improves the performance of the parser. To perform
multiple parses during one run of your Java program, you will have to call the ReInit() method
to reinitialize your parser if it is static. If the parser is non-static, you may use the
"new" operator to construct as many parsers as you wish. These can all be used simultaneously
from different threads. 

...

JAVA_UNICODE_ESCAPE: This is a boolean option whose default value is false. When set to true,
the generated parser uses an input stream object that processes Java Unicode escapes (\u...)
before sending characters to the token manager. By default, Java Unicode escapes are not processed.

This option is ignored if either of options USER_TOKEN_MANAGER, USER_CHAR_STREAM is set to
true. 

UNICODE_INPUT: This is a boolean option whose default value is false. When set to true, the
generated parser uses uses an input stream object that reads Unicode files. By default, ASCII
files are assumed. 
This option is ignored if either of options USER_TOKEN_MANAGER, USER_CHAR_STREAM is set to
true. 

...

USER_CHAR_STREAM: This is a boolean option whose default value is false. The default action
is to generate a character stream reader as specified by the options JAVA_UNICODE_ESCAPE and
UNICODE_INPUT. The generated token manager receives characters from this stream reader. If
this option is set to true, then the token manager is generated to read characters from any
character stream reader of type "CharStream.java". This file is generated into the generated
parser directory. 
This option is ignored if USER_TOKEN_MANAGER is set to true. 

So, the JAVA_UNICODE_ESCAPE option that is set is ignored assuming the javacc 2.0 and 2.1
behavior is the same (on a side note, this option seems to me to imply that a java unicode
escape sequence in the input would be read as a single character by a token manager, is that
what it means? and is that what we really want? Seems to me, that UNICODE_INPUT is what we
should be setting to true) and we have a USER_CHAR_STREAM true, which I think is where the
FastCharStream that is implemented in Lucene comes into play...

  public Query parse(String query) throws ParseException, TokenMgrError {
    ReInit(new FastCharStream(new StringReader(query)));
    return Query(field);
  }

Nothing in there seems to point at a potential problem. So I started looking at the generated
token manager. Here is where I got very lost...hard to follow the generated code since I'm
not familiar with how JavaCC works in general :(

It uses a bunch of switch statements accross several methods on the current character in order
to parse out the tokens, I wasn't really able to follow it closely, and figured I stop here
and wait on your response to the first part about transcoding. If you had done that, I was
wondering if anyone else might shed some light on this. Just wondering if the UNICODE_INPUT
option might make JavaCC's compliment for _TERM_START_CHAR match the characters outside of
ASCII in the token manager (it might be doing this already, I'm just not savy enough to realize
it).

Anyway, thats the end of my rambling for now...even if I'm off the mark, hope it was useful
to hear.

Eric
--
Eric D. Isakson        SAS Institute Inc.
Application Developer  SAS Campus Drive
XML Technologies       Cary, NC 27513
(919) 531-3639         http://www.sas.com


-----Original Message-----
From: PERRIN GOURON Olivier [mailto:olivier.perrin@xml-ais.com]
Sent: Monday, November 04, 2002 9:34 AM
To: 'lucene-dev@jakarta.apache.org'
Subject: problem with non latin characters in the query



Hello,

I am using Lucene to index UTF-16 and UTF-8 files . Those files are
trans-encoded to the right format so that they can be indexable with Lucene.
The index is searched through with queries made from an UTF-16 file.
Everything works fine as long my query file contains latin characters (even
specific french chars such as éàèêoe...)
Problems occur when the UTF-16 query file  contains not latin characters. I
have tried russian characters, such as ?, which is \u0418, but Lucene sends
me this error:

	Exception in thread "main"
org.apache.lucene.queryParser.TokenMgrError: Lexical
	error at line 1, column 8.  Encountered: "\u0018" (24), after : ""
        at
org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknown
Source)
        at org.apache.lucene.queryParser.QueryParser.jj_scan_token(Unknown
Source)
        at org.apache.lucene.queryParser.QueryParser.jj_3_1(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.jj_2_1(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.Clause(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
        at org.apache.lucene.CherchLeTex.main(CherchLeTex.java:51)

Seems that the queryParser doesn't use the right code for ?... I have tried
greek, and I does the same.
Is it due to the analyser? I don't think so since I changed my
StandardAnalyser for the FrenchAnalyser, and it still behaves the same
to the query parser?...

Another problem that gives exactly the same error message occurs when a
world in my query starts whith a local character (éàèêoe...). This is weird,
since local characters do not trigger errors when they are in the middle of
the world.

Have you ever met this problem? I would appreciate your help and advices

Thanks for your consideration

Olivier Perrin-Gouron
AIS Berger-Levrault


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message