lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Isakson" <Eric.Isak...@sas.com>
Subject RE: Indexing and searching non-latin languages using utf-8
Date Tue, 18 Mar 2003 16:52:17 GMT
Have you verified that your form inputs are getting to your query objects without the String
being mangled due to encoding problems?

I'm getting japanese in UTF-8 and use the technique described at http://w6.metronet.com/~wjm/tomcat/2001/Aug/msg00230.html
to get the data from the browser to Lucene. I build my index using the HTMLParser in the lucene
demos and give them a Reader object that was created from an InputStreamReader that specifies
the HTML file encodings (Shift_jis in my case).

There are a bunch of other issues I'm working on to support Japanese, but I'm getting search
results at this point.

The two places that encodings should come into play for you are parsing your source content
into the Reader or String that you use to create org.apache.lucene.document.Field objects
and getting the user query from their browser to the Query objects.

Eric
--
Eric D. Isakson        SAS Institute Inc.
Application Developer  SAS Campus Drive
XML Technologies       Cary, NC 27513
(919) 531-3639         http://www.sas.com



-----Original Message-----
From: MERCIER ALEXANDRE [mailto:alexandre.mercier@unilog.fr] 
Sent: Tuesday, March 18, 2003 11:36 AM
To: lucene-user@jakarta.apache.org
Subject: Indexing and searching non-latin languages using utf-8


Hi all,

I've a matter with indexing then searching docs written in non-latin languages and encoded
in utf-8 (Russian, by example).

I have a web application, with a simple form to search in the contents of the docs. When I
submit the form, I encode the query term in utf-8 with
encodeURI(String) but I match no doc. I think that is due to a bad indexing but I'm not sure.

Lucene is normally indexing docs in writing Terms in the 'xxx.tis' file, encoding it in utf-8,
I believe. So when it reads the file, it correctly gets russian characters (2 bytes) but when
writing them in the index, they seem different (I've listed the terms in my application console).

If someone has a solution to resolve my problem, all advices are welcome.

Thanks.
Alex


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message