lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KK <dioxide.softw...@gmail.com>
Subject Re: Posting unicode data to lucene not working during searching/retreival!
Date Thu, 21 May 2009 12:22:43 GMT
Thank you very much. As you told me I just added a single line in the jsp
page mentioning the charset as utf-8 and it worked like a charm. Thank you.

KK

On Thu, May 21, 2009 at 5:47 PM, Uwe Schindler <uwe@thetaphi.de> wrote:

> If you print the result e.g. to a webpage through the servlet API, the
> output is done with ISO-8859-1 (which is the default for HTTP). If you want
> to change this, you must tell the servlet layer the encoding before getting
> a PrintWriter (response.setEncoding(), response.setContentTpe("text/html;
> charset=UTF-8") or something like that. Or just get the ServletOutputStream
> and convert using a OutputStreamWriter just as before. But you have to tell
> the browser the encoding... (which is done through the Content-Type header
> step). This all is not Lucene specific, so you should ask on a
> Tomcat/Jetty/whatever-container-you use list.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: KK [mailto:dioxide.software@gmail.com]
> > Sent: Thursday, May 21, 2009 7:01 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Posting unicode data to lucene not working during
> > searching/retreival!
> >
> > I did all the changes but no improvement. the data is getting indexed
> > properly, I think because I'm able to see the results through luke and
> > luke
> > has option for seeing the results in both utf-8 encoding and string
> > default
> > encoding. I tried to use both but no difference. In both the cases I'm
> > able
> > to see the regional text. but no through the browser . How to decoding
> > when
> > fetching the search results throught searcher?
> >
> > Thanks
> > KK
> >
> > On Thu, May 21, 2009 at 1:05 PM, KK <dioxide.software@gmail.com> wrote:
> >
> > > Thanks @Uwe.
> > > #To answer your last mails query, textOnly is the output of the method
> > > downloadPage(), complete text thing includeing all html tags etc...
> > > #Instead of doing the encode/decode later, what i should do is when
> > > downloading the page through buffered reader put the charset as utf-8
> as
> > you
> > > mentioned in your last mail. so instead of
> > >  BufferedReader reader =
> > >                     new BufferedReader(new InputStreamReader(
> > >                     pageUrl.openStream()));
> > >
> > > I should do this,
> > > BufferedReader reader =
> > >                     new BufferedReader(new InputStreamReader(
> > >                      pageUrl.openStream(), <mention the charset like
> > > Charset.forName("UTF-8")>));
> > >
> > > right? and remove this conversion that I'm doing later ,
> > >
> > > byte [] utfEncodeByteArray = textOnly.getBytes();
> > >  String utfString = new String(utfEncodeByteArray,
> Charset.forName("UTF-
> > >  8"));
> > >
> > > This will make sure I'm not depending on the platform encoding, right?
> > This
> > > seems to fix my indexing issue. Now regarding searching I dont need to
> > > mention any charset thing there, I'm using stardard anyalyzer? As I
> know
> > > lucene stores the chars as raw unicode so when I present my query in
> the
> > > same unicode format lucene will give me proper results. Currently I'm
> > not
> > > using the encoding for HTTP parameters, I'll use that and let you know.
> > > Thank you very much.
> > >
> > > KK,
> > >
> > >
> > > On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler <uwe@thetaphi.de>
> wrote:
> > >
> > >> I forgot:
> > >>
> > >> > byte [] utfEncodeByteArray = textOnly.getBytes();
> > >> > String utfString = new String(utfEncodeByteArray,
> > Charset.forName("UTF-
> > >> > 8"));
> > >> >
> > >> > here textonly is the text extracted from the downloaded page
> > >>
> > >> What is textonly here? A String, if yes, why decode and then again
> > encode
> > >> it? The important thing is:
> > >> Strings in Java are always invariant to charsets (internally they are
> > >> UTF-16). So if you convert a byte array to a string you have to
> specify
> > a
> > >> charset (as you have done in new String code). If you convert a String
> > to
> > >> a
> > >> byte array, you must do the same.
> > >>
> > >> As mentioned in the mail before, the same is true, when converting
> > >> InputStreams to Readers and Writers to OutputStreams (this can be done
> > >> using
> > >> the converter).
> > >>
> > >> And: If you get a String from somewhere, that looks bad, you cannot
> > >> convert
> > >> the String to another encoding, it was corrupted during conversion to
> > >> string
> > >> before.
> > >>
> > >> E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify
> > the
> > >> input encoding of the HTTP parameters and so on.
> > >>
> > >> Uwe
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message