lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Posting unicode data to lucene not working during searching/retreival!
Date Thu, 21 May 2009 12:20:08 GMT
Hi KK,

> right? and remove this conversion that I'm doing later ,
> 
> byte [] utfEncodeByteArray = textOnly.getBytes();
>  String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
>  8"));
> 
> This will make sure I'm not depending on the platform encoding, right?

In principle, yes. This is because you encode the binary bytes to a
wrong-encoded stream in the platform encoding, then you decode that stream
again and reencode it using UTF-8. This works, as long as you will not loose
chars through this conversion!

> This
> seems to fix my indexing issue. Now regarding searching I dont need to
> mention any charset thing there, I'm using stardard anyalyzer? As I know
> lucene stores the chars as raw unicode so when I present my query in the
> same unicode format lucene will give me proper results. Currently I'm not
> using the encoding for HTTP parameters, I'll use that and let you know.
> Thank you very much.
> 
> KK,
> 
> On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> 
> > I forgot:
> >
> > > byte [] utfEncodeByteArray = textOnly.getBytes();
> > > String utfString = new String(utfEncodeByteArray,
> Charset.forName("UTF-
> > > 8"));
> > >
> > > here textonly is the text extracted from the downloaded page
> >
> > What is textonly here? A String, if yes, why decode and then again
> encode
> > it? The important thing is:
> > Strings in Java are always invariant to charsets (internally they are
> > UTF-16). So if you convert a byte array to a string you have to specify
> a
> > charset (as you have done in new String code). If you convert a String
> to a
> > byte array, you must do the same.
> >
> > As mentioned in the mail before, the same is true, when converting
> > InputStreams to Readers and Writers to OutputStreams (this can be done
> > using
> > the converter).
> >
> > And: If you get a String from somewhere, that looks bad, you cannot
> convert
> > the String to another encoding, it was corrupted during conversion to
> > string
> > before.
> >
> > E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify
> the
> > input encoding of the HTTP parameters and so on.
> >
> > Uwe
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message