From lucene-user-return-9303-apmail-jakarta-lucene-user-archive=jakarta.apache.org@jakarta.apache.org Thu Jul 15 10:16:16 2004 Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 15800 invoked from network); 15 Jul 2004 10:16:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 15 Jul 2004 10:16:15 -0000 Received: (qmail 58240 invoked by uid 500); 15 Jul 2004 10:16:09 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 58046 invoked by uid 500); 15 Jul 2004 10:16:08 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 58032 invoked by uid 99); 15 Jul 2004 10:16:08 -0000 X-ASF-Spam-Status: No, hits=1.3 required=10.0 tests=DNS_FROM_RFC_POST,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received: from [194.158.96.111] (HELO relay-6v.club-internet.fr) (194.158.96.111) by apache.org (qpsmtpd/0.27.1) with ESMTP; Thu, 15 Jul 2004 03:16:05 -0700 Received: from Claire (lven2-4-132.n.club-internet.fr [213.44.103.132]) by relay-6v.club-internet.fr (Postfix) with ESMTP id 4A00725614 for ; Thu, 15 Jul 2004 12:16:02 +0200 (CEST) From: "Bruno Tirel" To: "'Lucene Users List'" Subject: RE: Problems indexing Japanese with CJKAnalyzer Date: Thu, 15 Jul 2004 12:15:02 +0200 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook, Build 11.0.5510 Thread-Index: AcRp5FVuAno0C479TbyJv5rAPiRbCwAbdxPA In-Reply-To: <1A6B6A5A3597C340BB63728001DC787958AF13@kodos.na.wrq.com> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1409 Message-Id: <20040715101602.4A00725614@relay-6v.club-internet.fr> X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Hi All, I am also trying to localize everything for French application, using = UTF-8 encoding. I have already applied what Jon described. I fully confirm his recommandation for HTML Parser and HTML Document changes with UNICODE = and "UTF-8" encoding specification. In my case, I have still one case not functional : using meta-data from = HTML document, as in demo3 example. Trying to convert to "UTF-8", or "ISO-8859-1", it is still not correctly encoded when I check with Luke. A word "Propri=E9t=E9" is seen either as "Propri?t?" with a square, or = as "Propri=E3=A9t=E3=A9". My local codepage is Cp1252, so should be viewed as ISO-8859-1. Same = result when I use "local FileEncoding parameter. All the other fields are correctly encoded into UTF-8, tokenized and successfully searched through JSP page. Is anybody already facing this issue? Any help available? Best regards, Bruno -----Message d'origine----- De : Jon Schuster [mailto:jons@wrq.com]=20 Envoy=E9 : mercredi 14 juillet 2004 22:51 =C0 : 'Lucene Users List' Objet : RE: Problems indexing Japanese with CJKAnalyzer Hi all, Thanks for the help on indexing Japanese documents. I eventually got = things working, and here's an update so that other folks might have an easier = time in similar situations. The problem I had was indeed with the encoding, but it was more than = just the encoding on the initial creation of the HTMLParser (from the Lucene = demo package). In HTMLDocument, doing this: InputStreamReader reader =3D new InputStreamReader( new FileInputStream(f), "SJIS"); HTMLParser parser =3D new HTMLParser( reader ); creates the parser and feeds it Unicode from the original Shift-JIS = encoding document, but then when the document contents is fetched using this = line: Field fld =3D Field.Text("contents", parser.getReader() ); HTMLParser.getReader creates an InputStreamReader and OutputStreamWriter using the default encoding, which in my case was Windows 1252 = (essentially Latin-1). That was bad. In the HTMLParser.jj grammar file, adding an explicit encoding of "UTF8" = on both the Reader and Writer got things mostly working. The one missing = piece was in the "options" section of the HTMLParser.jj file. The original = grammar file generates an input character stream class that treats the input as = a stream of 1-byte characters. To have JavaCC generate a stream class that handles double-byte characters, you need the option = UNICODE_INPUT=3Dtrue. So, there were essentially three changes in two files: HTMLParser.jj - add UNICODE_INPUT=3Dtrue to options section; add = explicit "UTF8" encoding on Reader and Writer creation in getReader(). As far as = I can tell, this changes works fine for all of the languages I need to = handle, which are English, French, German, and Japanese. HTMLDocument - add explicit encoding of "SJIS" when creating the Reader = used to create the HTMLParser. (For western languages, I use encoding of "ISO8859_1".) And of course, use the right language tokenizer. --Jon --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org