Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 1579 invoked from network); 8 Jun 2004 17:43:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 8 Jun 2004 17:43:06 -0000 Received: (qmail 26425 invoked by uid 500); 8 Jun 2004 17:43:11 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 26391 invoked by uid 500); 8 Jun 2004 17:43:10 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 26376 invoked by uid 99); 8 Jun 2004 17:43:10 -0000 Received: from [149.173.6.17] (HELO merc94.na.sas.com) (149.173.6.17) by apache.org (qpsmtpd/0.27.1) with ESMTP; Tue, 08 Jun 2004 10:43:10 -0700 Received: from MERC23.na.sas.com ([10.19.9.179]) by merc94.na.sas.com with InterScan Messaging Security Suite; Tue, 08 Jun 2004 13:42:37 -0400 X-MimeOLE: Produced By Microsoft Exchange V6.5.6944.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: quoted-printable Subject: RE: Does Lucene support UNICODE? Date: Tue, 8 Jun 2004 13:42:37 -0400 Message-ID: <38BCC8D26B88894DB8D921BD95121CAC09937D@MERC23.na.sas.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Does Lucene support UNICODE? Thread-Index: AcRNJEtwOXQgL1LOSM6NiDU+TTOMEgAWxaIw From: "Eric Isakson" To: "Lucene Users List" X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N org.apache.lucene.demo.FileDocument.Document(File) is invoked from = IndexFiles and does: Reader reader =3D new BufferedReader(new InputStreamReader(is)); Notice that the InputStreamReader does not specify an encoding so your = default encoding is being used. You should probably write your own glue to create your index that knows = how to properly read your files in the appropriate encoding. Also, please don't cross post to the dev and user lists, this is an = appropriate question for the user list. Eric -----Original Message----- From: Satish Kagathare [mailto:satishk@it.iitb.ac.in]=20 Sent: Tuesday, June 08, 2004 3:07 AM To: Lucene Users List; lucene-dev@jakarta.apache.org Subject: Does Lucene support UNICODE? Hello, Does Lucene support UNICODE search and indexing of UNICODE=20 data(especially..Devnagari unicode data)? Does it make any difference between utf-8 & utf-16 unicode docs? Bcoz=20 java strings supports utf-16. Bcoz i tried indexing(using indexFiles & indexHTML from lucene Demo)=20 devnagari uni data(utf-8 & utf-16) & seraching for query using tomcat,=20 but it shows only utf-8 files and also shows files which does not=20 contain query. Also It does not show summary of fetched docs in correct=20 format. Also i have changed unicode range in HTMLparser.jj, StandardTokenizer.jj = &=20 QueryParser.jj and analyzer while indexing and parsing query but it does = not reflect any changes in output. =20 shall i have to write my own analyzer for devnagari unicode data or=20 Standaranalyzer will work for any languages? Or does it require more changes? Plz mention problems and solutions.=20 Thanks in advance Satish Kagathara, IIT Bombay. =20 =20 --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org