lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Praveen Peddi" <ppe...@contextmedia.com>
Subject Re: Problems indexing Japanese with CJKAnalyzer
Date Thu, 15 Jul 2004 13:12:24 GMT
If its a web application, you have to cal request.setEncoding("UTF-8")
before reading any parameters. Also make sure html page encoding is
specified as "UTF-8" in the metatag. most web app servers decode the request
paramaters in the system's default encoding algorithm. If u call above
method, I think it will solve ur problem.

Praveen
----- Original Message ----- 
From: "Bruno Tirel" <bruno.tirel@club-internet.fr>
To: "'Lucene Users List'" <lucene-user@jakarta.apache.org>
Sent: Thursday, July 15, 2004 6:15 AM
Subject: RE: Problems indexing Japanese with CJKAnalyzer


Hi All,

I am also trying to localize everything for French application, using UTF-8
encoding. I have already applied what Jon described. I fully confirm his
recommandation for HTML Parser and HTML Document changes with UNICODE and
"UTF-8" encoding specification.

In my case, I have still one case not functional : using meta-data from HTML
document, as in demo3 example. Trying to convert to "UTF-8", or
"ISO-8859-1", it is still not correctly encoded when I check with Luke.
A word "Propriété" is seen either as "Propri?t?" with a square, or as
"Propriã©tã©".
My local codepage is Cp1252, so should be viewed as ISO-8859-1. Same result
when I use "local FileEncoding parameter.
All the other fields are correctly encoded into UTF-8, tokenized and
successfully searched through JSP page.

Is anybody already facing this issue? Any help available?
Best regards,

Bruno


-----Message d'origine-----
De : Jon Schuster [mailto:jons@wrq.com]
Envoyé : mercredi 14 juillet 2004 22:51
À : 'Lucene Users List'
Objet : RE: Problems indexing Japanese with CJKAnalyzer

Hi all,

Thanks for the help on indexing Japanese documents. I eventually got things
working, and here's an update so that other folks might have an easier time
in similar situations.

The problem I had was indeed with the encoding, but it was more than just
the encoding on the initial creation of the HTMLParser (from the Lucene demo
package). In HTMLDocument, doing this:

InputStreamReader reader = new InputStreamReader( new
FileInputStream(f), "SJIS");
HTMLParser parser = new HTMLParser( reader );

creates the parser and feeds it Unicode from the original Shift-JIS encoding
document, but then when the document contents is fetched using this line:

Field fld = Field.Text("contents", parser.getReader() );

HTMLParser.getReader creates an InputStreamReader and OutputStreamWriter
using the default encoding, which in my case was Windows 1252 (essentially
Latin-1). That was bad.

In the HTMLParser.jj grammar file, adding an explicit encoding of "UTF8" on
both the Reader and Writer got things mostly working. The one missing piece
was in the "options" section of the HTMLParser.jj file. The original grammar
file generates an input character stream class that treats the input as a
stream of 1-byte characters. To have JavaCC generate a stream class that
handles double-byte characters, you need the option UNICODE_INPUT=true.

So, there were essentially three changes in two files:

HTMLParser.jj - add UNICODE_INPUT=true to options section; add explicit
"UTF8" encoding on Reader and Writer creation in getReader(). As far as I
can tell, this changes works fine for all of the languages I need to handle,
which are English, French, German, and Japanese.

HTMLDocument - add explicit encoding of "SJIS" when creating the Reader used
to create the HTMLParser. (For western languages, I use encoding of
"ISO8859_1".)

And of course, use the right language tokenizer.

--Jon

<earlier responses snipped; see the list archive>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message