lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KK <dioxide.softw...@gmail.com>
Subject Re: Posting unicode data to lucene not working during searching/retreival!
Date Thu, 21 May 2009 10:01:10 GMT
I did all the changes but no improvement. the data is getting indexed
properly, I think because I'm able to see the results through luke and luke
has option for seeing the results in both utf-8 encoding and string default
encoding. I tried to use both but no difference. In both the cases I'm able
to see the regional text. but no through the browser . How to decoding when
fetching the search results throught searcher?

Thanks
KK

On Thu, May 21, 2009 at 1:05 PM, KK <dioxide.software@gmail.com> wrote:

> Thanks @Uwe.
> #To answer your last mails query, textOnly is the output of the method
> downloadPage(), complete text thing includeing all html tags etc...
> #Instead of doing the encode/decode later, what i should do is when
> downloading the page through buffered reader put the charset as utf-8 as you
> mentioned in your last mail. so instead of
>  BufferedReader reader =
>                     new BufferedReader(new InputStreamReader(
>                     pageUrl.openStream()));
>
> I should do this,
> BufferedReader reader =
>                     new BufferedReader(new InputStreamReader(
>                      pageUrl.openStream(), <mention the charset like
> Charset.forName("UTF-8")>));
>
> right? and remove this conversion that I'm doing later ,
>
> byte [] utfEncodeByteArray = textOnly.getBytes();
>  String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
>  8"));
>
> This will make sure I'm not depending on the platform encoding, right? This
> seems to fix my indexing issue. Now regarding searching I dont need to
> mention any charset thing there, I'm using stardard anyalyzer? As I know
> lucene stores the chars as raw unicode so when I present my query in the
> same unicode format lucene will give me proper results. Currently I'm not
> using the encoding for HTTP parameters, I'll use that and let you know.
> Thank you very much.
>
> KK,
>
>
> On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
>
>> I forgot:
>>
>> > byte [] utfEncodeByteArray = textOnly.getBytes();
>> > String utfString = new String(utfEncodeByteArray, Charset.forName("UTF-
>> > 8"));
>> >
>> > here textonly is the text extracted from the downloaded page
>>
>> What is textonly here? A String, if yes, why decode and then again encode
>> it? The important thing is:
>> Strings in Java are always invariant to charsets (internally they are
>> UTF-16). So if you convert a byte array to a string you have to specify a
>> charset (as you have done in new String code). If you convert a String to
>> a
>> byte array, you must do the same.
>>
>> As mentioned in the mail before, the same is true, when converting
>> InputStreams to Readers and Writers to OutputStreams (this can be done
>> using
>> the converter).
>>
>> And: If you get a String from somewhere, that looks bad, you cannot
>> convert
>> the String to another encoding, it was corrupted during conversion to
>> string
>> before.
>>
>> E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify the
>> input encoding of the HTTP parameters and so on.
>>
>> Uwe
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message