Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 20739 invoked from network); 18 Mar 2003 16:52:38 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 18 Mar 2003 16:52:38 -0000 Received: (qmail 4096 invoked by uid 97); 18 Mar 2003 16:54:23 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@nagoya.betaversion.org Received: (qmail 4089 invoked from network); 18 Mar 2003 16:54:23 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 18 Mar 2003 16:54:23 -0000 Received: (qmail 20467 invoked by uid 500); 18 Mar 2003 16:52:35 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 20455 invoked from network); 18 Mar 2003 16:52:35 -0000 Received: from merc62.na.sas.com (149.173.6.49) by daedalus.apache.org with SMTP; 18 Mar 2003 16:52:35 -0000 Received: from merc12.na.sas.com ([10.19.11.9]) by 10.19.11.46 with InterScan Messaging Security Suite; Tue, 18 Mar 2003 11:52:17 -0500 X-MimeOLE: Produced By Microsoft Exchange V6.0.6410.0 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: Indexing and searching non-latin languages using utf-8 Date: Tue, 18 Mar 2003 11:52:17 -0500 Message-ID: <187D6D956106D84E9D8B280F6458FE140F5B67@merc12.na.sas.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Indexing and searching non-latin languages using utf-8 Thread-Index: AcLtbI7Qh8yXJCZmRz21Gsk5bvjnqwAAEwQQ From: "Eric Isakson" To: "Lucene Users List" X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Have you verified that your form inputs are getting to your query = objects without the String being mangled due to encoding problems? I'm getting japanese in UTF-8 and use the technique described at = http://w6.metronet.com/~wjm/tomcat/2001/Aug/msg00230.html to get the = data from the browser to Lucene. I build my index using the HTMLParser = in the lucene demos and give them a Reader object that was created from = an InputStreamReader that specifies the HTML file encodings (Shift_jis = in my case). There are a bunch of other issues I'm working on to support Japanese, = but I'm getting search results at this point. The two places that encodings should come into play for you are parsing = your source content into the Reader or String that you use to create = org.apache.lucene.document.Field objects and getting the user query from = their browser to the Query objects. Eric -- Eric D. Isakson SAS Institute Inc. Application Developer SAS Campus Drive XML Technologies Cary, NC 27513 (919) 531-3639 http://www.sas.com -----Original Message----- From: MERCIER ALEXANDRE [mailto:alexandre.mercier@unilog.fr]=20 Sent: Tuesday, March 18, 2003 11:36 AM To: lucene-user@jakarta.apache.org Subject: Indexing and searching non-latin languages using utf-8 Hi all, I've a matter with indexing then searching docs written in non-latin = languages and encoded in utf-8 (Russian, by example). I have a web application, with a simple form to search in the contents = of the docs. When I submit the form, I encode the query term in utf-8 = with encodeURI(String) but I match no doc. I think that is due to a bad = indexing but I'm not sure. Lucene is normally indexing docs in writing Terms in the 'xxx.tis' file, = encoding it in utf-8, I believe. So when it reads the file, it correctly = gets russian characters (2 bytes) but when writing them in the index, = they seem different (I've listed the terms in my application console). If someone has a solution to resolve my problem, all advices are = welcome. Thanks. Alex --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org