lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Posting unicode data to lucene not working during searching/retreival!
Date Thu, 21 May 2009 07:20:36 GMT
I forgot:

> byte [] utfEncodeByteArray = textOnly.getBytes();
> String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
> 8"));
> 
> here textonly is the text extracted from the downloaded page

What is textonly here? A String, if yes, why decode and then again encode
it? The important thing is:
Strings in Java are always invariant to charsets (internally they are
UTF-16). So if you convert a byte array to a string you have to specify a
charset (as you have done in new String code). If you convert a String to a
byte array, you must do the same.

As mentioned in the mail before, the same is true, when converting
InputStreams to Readers and Writers to OutputStreams (this can be done using
the converter).

And: If you get a String from somewhere, that looks bad, you cannot convert
the String to another encoding, it was corrupted during conversion to string
before.

E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify the
input encoding of the HTTP parameters and so on.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message