lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Schuster <>
Subject RE: Problems indexing Japanese with CJKAnalyzer ... Or French wit h UTF-8 and MetaData
Date Fri, 16 Jul 2004 22:02:03 GMT
If you're specifying the correct encoding in HTMLDocument when you create an
InputStreamReader for the HTML file, and if you're specifying UTF8 as the
encoding for the InputStreamReader and OutputStreamWriter in
HTMLParser.getReader, I don't see how the meta tag data would have a
different encoding than the other content that gets indexed.

The stuff in the JDK docs about Properties being stored as 8859-1 applies
only when properties are saved or loaded from a stream using the and Properties.load methods. In Lucene, meta tag information
is stored in a Properties structure only while parsing and tokenizing.

Some strategically placed System.out.printlns should let you see if the meta
tag strings are what you think they are.


-----Original Message-----
From: Bruno Tirel [] 
Sent: Thursday, July 15, 2004 8:07 AM
To: 'Lucene Users List'
Subject: RE: Problems indexing Japanese with CJKAnalyzer ... Or French with
UTF-8 and MetaData

I don't think I understand correctly your proposal.
As a basis, I am using Demo3 with indexHTML, HTMLDocument and HTMLParser.
Inside HTML parser, I am calling getMetaTags (calling addMetaData) wich
return Properties object. My issue is coming fron this definition :
Properties are stored into ISO-8859-1 encoding, when all my data encodings
inside and outside are "UTF-8".
I am not successful in getting UTF-8 values from this Parser.GetMetaTags()
through any conversion.
These data are extracted from an HTML page, with UTF-8 encoding declared at
the beginning of the file.
I do not see how to call a request.setEncoding("UTF-8") : I need the Parser
to have knowledge of UTF-8 encoding... And it doesn't appear when using
Properties object.

Any feedback?

-----Message d'origine-----
De : Praveen Peddi [] 
Envoyé : jeudi 15 juillet 2004 15:12
À : Lucene Users List
Objet : Re: Problems indexing Japanese with CJKAnalyzer

If its a web application, you have to cal request.setEncoding("UTF-8")
before reading any parameters. Also make sure html page encoding is
specified as "UTF-8" in the metatag. most web app servers decode the request
paramaters in the system's default encoding algorithm. If u call above
method, I think it will solve ur problem.

----- Original Message -----
From: "Bruno Tirel" <>
To: "'Lucene Users List'" <>
Sent: Thursday, July 15, 2004 6:15 AM
Subject: RE: Problems indexing Japanese with CJKAnalyzer

Hi All,

I am also trying to localize everything for French application, using UTF-8
encoding. I have already applied what Jon described. I fully confirm his
recommandation for HTML Parser and HTML Document changes with UNICODE and
"UTF-8" encoding specification.

In my case, I have still one case not functional : using meta-data from HTML
document, as in demo3 example. Trying to convert to "UTF-8", or
"ISO-8859-1", it is still not correctly encoded when I check with Luke.
A word "Propriété" is seen either as "Propri?t?" with a square, or as
My local codepage is Cp1252, so should be viewed as ISO-8859-1. Same result
when I use "local FileEncoding parameter.
All the other fields are correctly encoded into UTF-8, tokenized and
successfully searched through JSP page.

Is anybody already facing this issue? Any help available?
Best regards,


-----Message d'origine-----
De : Jon Schuster []
Envoyé : mercredi 14 juillet 2004 22:51
À : 'Lucene Users List'
Objet : RE: Problems indexing Japanese with CJKAnalyzer

Hi all,

Thanks for the help on indexing Japanese documents. I eventually got things
working, and here's an update so that other folks might have an easier time
in similar situations.

The problem I had was indeed with the encoding, but it was more than just
the encoding on the initial creation of the HTMLParser (from the Lucene demo
package). In HTMLDocument, doing this:

InputStreamReader reader = new InputStreamReader( new
FileInputStream(f), "SJIS");
HTMLParser parser = new HTMLParser( reader );

creates the parser and feeds it Unicode from the original Shift-JIS encoding
document, but then when the document contents is fetched using this line:

Field fld = Field.Text("contents", parser.getReader() );

HTMLParser.getReader creates an InputStreamReader and OutputStreamWriter
using the default encoding, which in my case was Windows 1252 (essentially
Latin-1). That was bad.

In the HTMLParser.jj grammar file, adding an explicit encoding of "UTF8" on
both the Reader and Writer got things mostly working. The one missing piece
was in the "options" section of the HTMLParser.jj file. The original grammar
file generates an input character stream class that treats the input as a
stream of 1-byte characters. To have JavaCC generate a stream class that
handles double-byte characters, you need the option UNICODE_INPUT=true.

So, there were essentially three changes in two files:

HTMLParser.jj - add UNICODE_INPUT=true to options section; add explicit
"UTF8" encoding on Reader and Writer creation in getReader(). As far as I
can tell, this changes works fine for all of the languages I need to handle,
which are English, French, German, and Japanese.

HTMLDocument - add explicit encoding of "SJIS" when creating the Reader used
to create the HTMLParser. (For western languages, I use encoding of

And of course, use the right language tokenizer.


<earlier responses snipped; see the list archive>

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message