lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll" <>
Subject Re: Problems indexing Japanese with CJKAnalyzer
Date Tue, 06 Jul 2004 17:57:18 GMT

Java expects your files to be in the encoding of the Native Locale.  In most cases in the
U.S., this will be English.  If you want to read files in that are in a different encoding,
you have to tell Java what your encoding is, in this case, Shift JIS.  See the javadocs for

Thus, you will most likely have to alter the Lucene demo to load your document in a different
way.  If you look at HTMLDocument, my guess is you can replace the construction of the HTMLParser
with something like (somewhere around line 63):

HTMLParser parser = new HTMLParser(new InputStreamReader(new FileInputStream(file), "SJIS"));
//Not sure if "SJIS" is the right moniker for Shift-Jis.

I have not tested this, but I recall do this when I first started to be able to index my UTF-8

I am not exactly sure if SJIS is the abbreviated form for Shift-JIS. It will throw a UnsupportedEncodinException
if it is not right.  In that case, you can look up in the Java documentation to see what is

Btw, do this with your original files, which are in Shift-JIS.

>>> 07/06/04 12:53PM >>>
Hi Jon,

It sounds to me like you have a character encoding problem.  The 
native2ascii tool is designed to produce input for the Java compiler; 
the "\u7aef" notation you're seeing is understood by Java string 
interpreters to mean the corresponding hexadecimal Unicode code point. 
Other Java programs, however, depending on their implementation, may not 
understand this notation.  Alternatively, maybe the notation is 
understood, but the conversion from Shift-JIS to Java Unicode format is 
not being performed properly; if you don't tell native2ascii the source 
encoding, it will assume the "native" encoding for the platform--on 
Windows, depending on which localized version you've got, this is likely 
to be the so-called code page 1252 (ISO-8859-1 with a few 
modifications).  Converting from one character encoding to another with 
incorrect assumptions about the source encoding can only lead to sorrow 
and confusion.

I think you can use the native2ascii tool to do what you want 
(untested), but it will take two passes:

1. Use native2ascii to convert your file(s) to Java Unicode format, but 
tell it the source encoding:

    native2ascii -encoding SJIS inputfile outputfile1

2. Tell it to convert from Java Unicode format to UTF-8:

    native2ascii -reverse -encoding UTF8 outputfile1 finaloutput

Here's a web page with more information on native2ascii:


Hope it helps,
Steve Rowe

Jon Schuster wrote:
> I've gone through all of the past messages regarding the CJKAnalyzer but I
> still must be doing something wrong because my searches don't work.
> I'm using the IndexHTML application from the org.apache.lucene.demo package
> to do the indexing, and I've changed the analyzer to use the CJKAnalyzer.
> I've also tried with and without setting the file.encoding to Shift-JIS.
> I've tried indexing the HTML files, which contain Shift-JIS, without
> conversion to Unicode and I get assorted "Parse Aborted: Lexical error..."
> messages. I've also tried converting the Shift-JIS HTML files to Unicode by
> first running them through the native2ascii tool.
> When the files are converted via native2ascii, they index without errors,
> but the index appears to contain the Unicode characters as literal strings
> such as "u7aef", "u7af6", etc. Searching for an English word produces
> results that have text like "code \u5c5e\u6027".
> Since others have gotten Japanese indexing to work, what's the secret I'm
> missing?
> Thanks,
> Jon

To unsubscribe, e-mail: 
For additional commands, e-mail: 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message