lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Schuster <j...@wrq.com>
Subject RE: Problems indexing Japanese with CJKAnalyzer
Date Wed, 14 Jul 2004 20:50:58 GMT
Hi all,

Thanks for the help on indexing Japanese documents. I eventually got things
working, and here's an update so that other folks might have an easier time
in similar situations.

The problem I had was indeed with the encoding, but it was more than just
the encoding on the initial creation of the HTMLParser (from the Lucene demo
package). In HTMLDocument, doing this:

	InputStreamReader reader = new InputStreamReader( new
FileInputStream(f), "SJIS");
	HTMLParser parser = new HTMLParser( reader );

creates the parser and feeds it Unicode from the original Shift-JIS encoding
document, but then when the document contents is fetched using this line:

	Field fld = Field.Text("contents", parser.getReader() );

HTMLParser.getReader creates an InputStreamReader and OutputStreamWriter
using the default encoding, which in my case was Windows 1252 (essentially
Latin-1). That was bad.

In the HTMLParser.jj grammar file, adding an explicit encoding of "UTF8" on
both the Reader and Writer got things mostly working. The one missing piece
was in the "options" section of the HTMLParser.jj file. The original grammar
file generates an input character stream class that treats the input as a
stream of 1-byte characters. To have JavaCC generate a stream class that
handles double-byte characters, you need the option UNICODE_INPUT=true.

So, there were essentially three changes in two files:

HTMLParser.jj - add UNICODE_INPUT=true to options section; add explicit
"UTF8" encoding on Reader and Writer creation in getReader(). As far as I
can tell, this changes works fine for all of the languages I need to handle,
which are English, French, German, and Japanese.

HTMLDocument - add explicit encoding of "SJIS" when creating the Reader used
to create the HTMLParser. (For western languages, I use encoding of
"ISO8859_1".)

And of course, use the right language tokenizer.

--Jon

<earlier responses snipped; see the list archive>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message