lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Schuster <>
Subject Problems indexing Japanese with CJKAnalyzer
Date Fri, 02 Jul 2004 20:49:19 GMT

I've gone through all of the past messages regarding the CJKAnalyzer but I
still must be doing something wrong because my searches don't work.

I'm using the IndexHTML application from the org.apache.lucene.demo package
to do the indexing, and I've changed the analyzer to use the CJKAnalyzer.
I've also tried with and without setting the file.encoding to Shift-JIS.
I've tried indexing the HTML files, which contain Shift-JIS, without
conversion to Unicode and I get assorted "Parse Aborted: Lexical error..."
messages. I've also tried converting the Shift-JIS HTML files to Unicode by
first running them through the native2ascii tool.

When the files are converted via native2ascii, they index without errors,
but the index appears to contain the Unicode characters as literal strings
such as "u7aef", "u7af6", etc. Searching for an English word produces
results that have text like "code \u5c5e\u6027".

Since others have gotten Japanese indexing to work, what's the secret I'm


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message