Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 70793 invoked from network); 21 May 2009 12:18:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 21 May 2009 12:18:36 -0000 Received: (qmail 71886 invoked by uid 500); 21 May 2009 12:18:47 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 71824 invoked by uid 500); 21 May 2009 12:18:46 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 71762 invoked by uid 99); 21 May 2009 12:18:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 May 2009 12:18:42 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [80.190.230.99] (HELO mail.troja.net) (80.190.230.99) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 May 2009 12:18:32 +0000 Received: from localhost (localhost [127.0.0.1]) by mail.troja.net (Postfix) with ESMTP id 9A0D252388 for ; Thu, 21 May 2009 14:18:09 +0200 (CEST) Received: from mail.troja.net ([127.0.0.1]) by localhost (cyca.troja.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 32744-05 for ; Thu, 21 May 2009 14:18:06 +0200 (CEST) Received: from VEGA (pl744.nas932.p-ibaraki.nttpc.ne.jp [124.154.234.232]) (using SSLv3 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.troja.net (Postfix) with ESMTP id 618D95203C for ; Thu, 21 May 2009 14:18:00 +0200 (CEST) From: "Uwe Schindler" To: References: <8db6d74a0905202325w520f7cf4rff47a36a79f8aa58@mail.gmail.com> <5BF4D4314683451CA210E1B63EE007C9@VEGA> <8db6d74a0905210002t2a28426cgd6083aa61741c111@mail.gmail.com> <8db6d74a0905210035u6c16e0a9sfe7c5abf3a8a79a5@mail.gmail.com> <8db6d74a0905210301j5f14a7cag4b6c2fc381ee5db3@mail.gmail.com> Subject: RE: Posting unicode data to lucene not working during searching/retreival! Date: Thu, 21 May 2009 21:17:45 +0900 Message-ID: <11BE73DDCBE14F9697AB1794EE4D668D@VEGA> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <8db6d74a0905210301j5f14a7cag4b6c2fc381ee5db3@mail.gmail.com> Thread-Index: AcnZ+zJCzQOxujUoQ3yJLOIv3o8vHAAElMoQ X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5579 X-Virus-Checked: Checked by ClamAV on apache.org If you print the result e.g. to a webpage through the servlet API, the output is done with ISO-8859-1 (which is the default for HTTP). If you want to change this, you must tell the servlet layer the encoding before getting a PrintWriter (response.setEncoding(), response.setContentTpe("text/html; charset=UTF-8") or something like that. Or just get the ServletOutputStream and convert using a OutputStreamWriter just as before. But you have to tell the browser the encoding... (which is done through the Content-Type header step). This all is not Lucene specific, so you should ask on a Tomcat/Jetty/whatever-container-you use list. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de > -----Original Message----- > From: KK [mailto:dioxide.software@gmail.com] > Sent: Thursday, May 21, 2009 7:01 PM > To: java-user@lucene.apache.org > Subject: Re: Posting unicode data to lucene not working during > searching/retreival! > > I did all the changes but no improvement. the data is getting indexed > properly, I think because I'm able to see the results through luke and > luke > has option for seeing the results in both utf-8 encoding and string > default > encoding. I tried to use both but no difference. In both the cases I'm > able > to see the regional text. but no through the browser . How to decoding > when > fetching the search results throught searcher? > > Thanks > KK > > On Thu, May 21, 2009 at 1:05 PM, KK wrote: > > > Thanks @Uwe. > > #To answer your last mails query, textOnly is the output of the method > > downloadPage(), complete text thing includeing all html tags etc... > > #Instead of doing the encode/decode later, what i should do is when > > downloading the page through buffered reader put the charset as utf-8 as > you > > mentioned in your last mail. so instead of > > BufferedReader reader = > > new BufferedReader(new InputStreamReader( > > pageUrl.openStream())); > > > > I should do this, > > BufferedReader reader = > > new BufferedReader(new InputStreamReader( > > pageUrl.openStream(), > Charset.forName("UTF-8")>)); > > > > right? and remove this conversion that I'm doing later , > > > > byte [] utfEncodeByteArray = textOnly.getBytes(); > > String utfString = new String(utfEncodeByteArray, Charset.forName("UTF- > > 8")); > > > > This will make sure I'm not depending on the platform encoding, right? > This > > seems to fix my indexing issue. Now regarding searching I dont need to > > mention any charset thing there, I'm using stardard anyalyzer? As I know > > lucene stores the chars as raw unicode so when I present my query in the > > same unicode format lucene will give me proper results. Currently I'm > not > > using the encoding for HTTP parameters, I'll use that and let you know. > > Thank you very much. > > > > KK, > > > > > > On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler wrote: > > > >> I forgot: > >> > >> > byte [] utfEncodeByteArray = textOnly.getBytes(); > >> > String utfString = new String(utfEncodeByteArray, > Charset.forName("UTF- > >> > 8")); > >> > > >> > here textonly is the text extracted from the downloaded page > >> > >> What is textonly here? A String, if yes, why decode and then again > encode > >> it? The important thing is: > >> Strings in Java are always invariant to charsets (internally they are > >> UTF-16). So if you convert a byte array to a string you have to specify > a > >> charset (as you have done in new String code). If you convert a String > to > >> a > >> byte array, you must do the same. > >> > >> As mentioned in the mail before, the same is true, when converting > >> InputStreams to Readers and Writers to OutputStreams (this can be done > >> using > >> the converter). > >> > >> And: If you get a String from somewhere, that looks bad, you cannot > >> convert > >> the String to another encoding, it was corrupted during conversion to > >> string > >> before. > >> > >> E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify > the > >> input encoding of the HTTP parameters and so on. > >> > >> Uwe > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org