Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 65019 invoked from network); 14 Jul 2004 20:51:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 14 Jul 2004 20:51:29 -0000 Received: (qmail 85997 invoked by uid 500); 14 Jul 2004 20:51:22 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 85964 invoked by uid 500); 14 Jul 2004 20:51:22 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 85951 invoked by uid 99); 14 Jul 2004 20:51:21 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [150.215.214.117] (HELO ein.wrq.com) (150.215.214.117) by apache.org (qpsmtpd/0.27.1) with ESMTP; Wed, 14 Jul 2004 13:51:19 -0700 Received: from abra.na.wrq.com (wrq.com [150.215.15.20]) by ein.wrq.com (8.12.8/8.12.8) with ESMTP id i6EKpE0V016113 for ; Wed, 14 Jul 2004 13:51:14 -0700 Received: by abra.na.wrq.com with Internet Mail Service (5.5.2657.72) id <3T4L478S>; Wed, 14 Jul 2004 13:51:12 -0700 Message-ID: <1A6B6A5A3597C340BB63728001DC787958AF13@kodos.na.wrq.com> From: Jon Schuster To: "'Lucene Users List'" Subject: RE: Problems indexing Japanese with CJKAnalyzer Date: Wed, 14 Jul 2004 13:50:58 -0700 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2657.72) Content-Type: text/plain X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Hi all, Thanks for the help on indexing Japanese documents. I eventually got things working, and here's an update so that other folks might have an easier time in similar situations. The problem I had was indeed with the encoding, but it was more than just the encoding on the initial creation of the HTMLParser (from the Lucene demo package). In HTMLDocument, doing this: InputStreamReader reader = new InputStreamReader( new FileInputStream(f), "SJIS"); HTMLParser parser = new HTMLParser( reader ); creates the parser and feeds it Unicode from the original Shift-JIS encoding document, but then when the document contents is fetched using this line: Field fld = Field.Text("contents", parser.getReader() ); HTMLParser.getReader creates an InputStreamReader and OutputStreamWriter using the default encoding, which in my case was Windows 1252 (essentially Latin-1). That was bad. In the HTMLParser.jj grammar file, adding an explicit encoding of "UTF8" on both the Reader and Writer got things mostly working. The one missing piece was in the "options" section of the HTMLParser.jj file. The original grammar file generates an input character stream class that treats the input as a stream of 1-byte characters. To have JavaCC generate a stream class that handles double-byte characters, you need the option UNICODE_INPUT=true. So, there were essentially three changes in two files: HTMLParser.jj - add UNICODE_INPUT=true to options section; add explicit "UTF8" encoding on Reader and Writer creation in getReader(). As far as I can tell, this changes works fine for all of the languages I need to handle, which are English, French, German, and Japanese. HTMLDocument - add explicit encoding of "SJIS" when creating the Reader used to create the HTMLParser. (For western languages, I use encoding of "ISO8859_1".) And of course, use the right language tokenizer. --Jon --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org