lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KK <dioxide.softw...@gmail.com>
Subject Re: Posting unicode data to lucene not working during searching/retreival!
Date Thu, 21 May 2009 07:35:29 GMT
Thanks @Uwe.
#To answer your last mails query, textOnly is the output of the method
downloadPage(), complete text thing includeing all html tags etc...
#Instead of doing the encode/decode later, what i should do is when
downloading the page through buffered reader put the charset as utf-8 as you
mentioned in your last mail. so instead of
BufferedReader reader =
                    new BufferedReader(new InputStreamReader(
                    pageUrl.openStream()));

I should do this,
BufferedReader reader =
                    new BufferedReader(new InputStreamReader(
                    pageUrl.openStream(), <mention the charset like
Charset.forName("UTF-8")>));

right? and remove this conversion that I'm doing later ,

byte [] utfEncodeByteArray = textOnly.getBytes();
 String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
 8"));

This will make sure I'm not depending on the platform encoding, right? This
seems to fix my indexing issue. Now regarding searching I dont need to
mention any charset thing there, I'm using stardard anyalyzer? As I know
lucene stores the chars as raw unicode so when I present my query in the
same unicode format lucene will give me proper results. Currently I'm not
using the encoding for HTTP parameters, I'll use that and let you know.
Thank you very much.

KK,

On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler <uwe@thetaphi.de> wrote:

> I forgot:
>
> > byte [] utfEncodeByteArray = textOnly.getBytes();
> > String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
> > 8"));
> >
> > here textonly is the text extracted from the downloaded page
>
> What is textonly here? A String, if yes, why decode and then again encode
> it? The important thing is:
> Strings in Java are always invariant to charsets (internally they are
> UTF-16). So if you convert a byte array to a string you have to specify a
> charset (as you have done in new String code). If you convert a String to a
> byte array, you must do the same.
>
> As mentioned in the mail before, the same is true, when converting
> InputStreams to Readers and Writers to OutputStreams (this can be done
> using
> the converter).
>
> And: If you get a String from somewhere, that looks bad, you cannot convert
> the String to another encoding, it was corrupted during conversion to
> string
> before.
>
> E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify the
> input encoding of the HTTP parameters and so on.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message