lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From PERRIN GOURON Olivier <olivier.per...@xml-ais.com>
Subject RE: problem with non latin characters in the query
Date Tue, 05 Nov 2002 09:53:39 GMT

Hello,

Thanks, Eric, for your help. I really appreciate it. I followed
approximatively the steps, but I am now stucked.
The transcoding is done through two methods I added to "FileDocument.java".
They define which charset is to be used.
UTF-16 is detected with the xFE xFF mark (byte order)
UTF-8 is detected with a heuristic methodon the first thousand bytes,
checking that byte pairs belong to a specific range.
ISO is the default charset

------------------------------------------------

public static String getEncoding(FileInputStream inTest)	{

		try	{
			BufferedInputStream buff=new
BufferedInputStream(inTest);
			buff.mark(10);
			int byte1=buff.read();
			int byte2=buff.read();
			buff.reset();

			//detect UTF-16
			if ((byte1==254) && (byte2==255)) {return
("UTF-16");}

			//detect UTF-8
			else if (isUTF8(buff, 1000, true))	{return
("UTF-8");}

			//default
			else {return "ISO-8859-1";}
		}
		catch (Exception e) {
		      System.out.println(" GETENCODING > erreur de type : "
+ e.getClass() + " qui dit : " + e.getMessage());
		}
		return null;

	}
		
	public static boolean isUTF8(BufferedInputStream buff, int count,
boolean allAscii)	{

		try	{
			int byte1=buff.read();

			//stop at the 1000th iteration
			if ((count==0)&&(allAscii==false)){return true;}
			// if all the cars are ASCII, then not UTF-8
			else if ((count==0)&&(allAscii==true)){return
false;}

			else {

				if((byte1>=192)&&(byte1<=253))	{
					allAscii=false;
					int byte2=buff.read();
					if ((byte2>=128)&&(byte2<=191))	{
						return
isUTF8(buff,count-1,allAscii);
					}
					else {return false;}
				}
				else {return isUTF8(buff,count-1,allAscii);}
			}
		}
		catch (Exception e) {
		      System.out.println(" ISUTF8 > erreur de type : " +
e.getClass() + " qui dit : " + e.getMessage());
		}
		return false;
	}
--------------------------------------------------------------

Then I use the string returned by getEncoding to get the right format
inputStreamReader:
reader = new BufferedReader(new InputStreamReader(in,charsetName));
I'm not completely sure about this method, but it seems to work fine.

The query is in an UTF16 file, and I transcode it directly to the internal
format:
...InputStreamReader(in,"UTF-16"))

So I guess the problem comes from somewhere else
 
There is another behavior I would like to show you : as I told in my first
email, characters like éèê cause an error when they are in the beginning of
the word, and are indexed/searched fine when they are in the middle of the
word. Characters like these belong to the "\u00d8"-"\u00f6" range.
In the same way, any 1 byte not-ASCII character (ranges "\u00c0"-"\u00d6" ;
"\u00d8"-"\u00f6" ; "\u00f8"-"\u00ff") will trigger the same behavior.
On the other hand, characters like "oe" ("\u0153") do not cause an error
when they appear in the beginning of the word, but are substituted with a
space character in the query operation (manoeuvre becomes "man uvre"). The
worst occurs when I search "man uvre" directly : lucene find the matching
documents, which means that the "oe" was stripped during the index
operation... so if 2 bytes characters are removed, how to search through a
really unicode (I mean with no ASCII char) set of documents?

I guess a class works on 1 byte chars inside of 2 bytes chars somewhere
inside Lucene, but I am not able to say where and how.

Thanks for your help and consideration



-----Message d'origine-----
De : Eric Isakson [mailto:Eric.Isakson@sas.com]
Envoyé : lundi 4 novembre 2002 20:16
À : Lucene Developers List
Objet : RE: problem with non latin characters in the query


Olivier,

I'm no expert on this by any means, but I poked around in the sources this
morning trying to understand where this problem may be occurring as I'm
trying to get familiar with any internationalization problems I'm going to
run into with Lucene. This message rambles on a bit, but follows my train of
thought as I looked at this problem.

It doesn't look to me like the analyzer will have anything to do with it.
The problem occurrs somewhere inside
org.apache.lucene.queryParser.QueryParser.jj lexical analysis so I looked
there to get a better understanding of how that works. 

To process your query, QueryParser reads your query using a StringReader.
So, my first question to you is, did the query make it from your UTF-16
query file and get transcoded properly to the java string (for instance
using an InputStreamReader with a constructor that set the encoding to
UTF-16 or some String constructor where you supplied a byte[] and charset).

Assuming that part was handled properly, we next need to look at the query
parser's grammar for problems, perhaps the character you wish to use is not
part of the grammar for a token and as for the start characters giving the
same error, we should look here too:

<*> TOKEN : {
  <#_NUM_CHAR:   ["0"-"9"] >
| <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^",
                          "[", "]", "\"", "{", "}", "~", "*", "?" ] >
| <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", "^",
                           "[", "]", "\"", "{", "}", "~", "*", "?" ]
                       | <_ESCAPED_CHAR> ) >
| <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >
| <#_WHITESPACE: ( " " | "\t" ) >
}

<DEFAULT, RangeIn, RangeEx> SKIP : {
  <<_WHITESPACE>>
}

<DEFAULT> TOKEN : {
  <AND:       ("AND" | "&&") >
| <OR:        ("OR" | "||") >
| <NOT:       ("NOT" | "!") >
| <PLUS:      "+" >
| <MINUS:     "-" >
| <LPAREN:    "(" >
| <RPAREN:    ")" >
| <COLON:     ":" >
| <CARAT:     "^" > : Boost
| <QUOTED:     "\"" (~["\""])+ "\"">
| <TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
| <FUZZY:     "~" >
| <SLOP:      "~" (<_NUM_CHAR>)+ >
| <PREFIXTERM:  <_TERM_START_CHAR> (<_TERM_CHAR>)* "*" >
| <WILDTERM:  <_TERM_START_CHAR>
              (<_TERM_CHAR> | ( [ "*", "?" ] ))* >
| <RANGEIN_START: "[" > : RangeIn
| <RANGEEX_START: "{" > : RangeEx
}

... there is a bit more to this, see the
org.apache.lucene.queryParser.QueryParser.jj file for the rest of the
details ...

So, TERM is a _TERM_START_CHAR followed optionally by a series of _TERM_CHAR
_TERM_CHAR is either _TERM_START_CHAR or _ESCAPED_CHAR
and _TERM_START_CHAR is the compliment of several significant query
characters or an _ESCAPED_CHAR.
Hmm...since _TERM_START_CHAR includes _ESCAPED_CHAR, why do we need a
separate definition of _TERM_START_CHAR and _TERM_CHAR?

Assuming your string was read with the right character set, this leaves me
wondering if the compliment operator in the JavaCC grammar did the right
thing or maybe it is still something with the reader. Note also that the
QUOTED production occurs before TERM but also uses the compliment operator,
so you may be able to work around the term start problem you were having if
you quote your terms. Just guessing here, I'm new to javaCC.

The QueryParser.jj file sets a few javacc options and the javacc task in
build.xml has the opportunity to set others but doesn't:

from QueryParser.jj:
options {
  STATIC=false;
  JAVA_UNICODE_ESCAPE=true;
  USER_CHAR_STREAM=true;
}

from build.xml:
    <javacc
      target="${src.dir}/org/apache/lucene/queryParser/QueryParser.jj"
      javacchome="${javacc.zip.dir}"
      outputdirectory="${build.src}/org/apache/lucene/queryParser"
    />

javacc task has an optional parameter unicodeinput
(http://jakarta.apache.org/ant/manual/OptionalTasks/javacc.html) that I got
curious about, so I went and read the doc on the javacc options (note this
is from the webgain site and is not the version used by lucene, though I'd
expect these options to have the same behavior)
http://www.webgain.com/products/java_cc/javaccgrm.html#prod2 states:

STATIC: This is a boolean option whose default value is true. If true, all
methods and class variables are specified as static in the generated parser
and token manager. This allows only one parser object to be present, but it
improves the performance of the parser. To perform multiple parses during
one run of your Java program, you will have to call the ReInit() method to
reinitialize your parser if it is static. If the parser is non-static, you
may use the "new" operator to construct as many parsers as you wish. These
can all be used simultaneously from different threads. 

...

JAVA_UNICODE_ESCAPE: This is a boolean option whose default value is false.
When set to true, the generated parser uses an input stream object that
processes Java Unicode escapes (\u...) before sending characters to the
token manager. By default, Java Unicode escapes are not processed. 
This option is ignored if either of options USER_TOKEN_MANAGER,
USER_CHAR_STREAM is set to true. 

UNICODE_INPUT: This is a boolean option whose default value is false. When
set to true, the generated parser uses uses an input stream object that
reads Unicode files. By default, ASCII files are assumed. 
This option is ignored if either of options USER_TOKEN_MANAGER,
USER_CHAR_STREAM is set to true. 

...

USER_CHAR_STREAM: This is a boolean option whose default value is false. The
default action is to generate a character stream reader as specified by the
options JAVA_UNICODE_ESCAPE and UNICODE_INPUT. The generated token manager
receives characters from this stream reader. If this option is set to true,
then the token manager is generated to read characters from any character
stream reader of type "CharStream.java". This file is generated into the
generated parser directory. 
This option is ignored if USER_TOKEN_MANAGER is set to true. 

So, the JAVA_UNICODE_ESCAPE option that is set is ignored assuming the
javacc 2.0 and 2.1 behavior is the same (on a side note, this option seems
to me to imply that a java unicode escape sequence in the input would be
read as a single character by a token manager, is that what it means? and is
that what we really want? Seems to me, that UNICODE_INPUT is what we should
be setting to true) and we have a USER_CHAR_STREAM true, which I think is
where the FastCharStream that is implemented in Lucene comes into play...

  public Query parse(String query) throws ParseException, TokenMgrError {
    ReInit(new FastCharStream(new StringReader(query)));
    return Query(field);
  }

Nothing in there seems to point at a potential problem. So I started looking
at the generated token manager. Here is where I got very lost...hard to
follow the generated code since I'm not familiar with how JavaCC works in
general :(

It uses a bunch of switch statements accross several methods on the current
character in order to parse out the tokens, I wasn't really able to follow
it closely, and figured I stop here and wait on your response to the first
part about transcoding. If you had done that, I was wondering if anyone else
might shed some light on this. Just wondering if the UNICODE_INPUT option
might make JavaCC's compliment for _TERM_START_CHAR match the characters
outside of ASCII in the token manager (it might be doing this already, I'm
just not savy enough to realize it).

Anyway, thats the end of my rambling for now...even if I'm off the mark,
hope it was useful to hear.

Eric
--
Eric D. Isakson        SAS Institute Inc.
Application Developer  SAS Campus Drive
XML Technologies       Cary, NC 27513
(919) 531-3639         http://www.sas.com


-----Original Message-----
From: PERRIN GOURON Olivier [mailto:olivier.perrin@xml-ais.com]
Sent: Monday, November 04, 2002 9:34 AM
To: 'lucene-dev@jakarta.apache.org'
Subject: problem with non latin characters in the query



Hello,

I am using Lucene to index UTF-16 and UTF-8 files . Those files are
trans-encoded to the right format so that they can be indexable with Lucene.
The index is searched through with queries made from an UTF-16 file.
Everything works fine as long my query file contains latin characters (even
specific french chars such as éàèêoe...)
Problems occur when the UTF-16 query file  contains not latin characters. I
have tried russian characters, such as ?, which is \u0418, but Lucene sends
me this error:

	Exception in thread "main"
org.apache.lucene.queryParser.TokenMgrError: Lexical
	error at line 1, column 8.  Encountered: "\u0018" (24), after : ""
        at
org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknown
Source)
        at org.apache.lucene.queryParser.QueryParser.jj_scan_token(Unknown
Source)
        at org.apache.lucene.queryParser.QueryParser.jj_3_1(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.jj_2_1(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.Clause(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
        at org.apache.lucene.CherchLeTex.main(CherchLeTex.java:51)

Seems that the queryParser doesn't use the right code for ?... I have tried
greek, and I does the same.
Is it due to the analyser? I don't think so since I changed my
StandardAnalyser for the FrenchAnalyser, and it still behaves the same
to the query parser?...

Another problem that gives exactly the same error message occurs when a
world in my query starts whith a local character (éàèêoe...). This is weird,
since local characters do not trigger errors when they are in the middle of
the world.

Have you ever met this problem? I would appreciate your help and advices

Thanks for your consideration

Olivier Perrin-Gouron
AIS Berger-Levrault


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message