Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 52860 invoked from network); 3 Feb 2011 05:49:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Feb 2011 05:49:57 -0000 Received: (qmail 63412 invoked by uid 500); 3 Feb 2011 05:49:56 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 63009 invoked by uid 500); 3 Feb 2011 05:49:53 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 62993 invoked by uid 99); 3 Feb 2011 05:49:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 05:49:52 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 05:49:50 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 7F35218B5FA for ; Thu, 3 Feb 2011 05:49:29 +0000 (UTC) Date: Thu, 3 Feb 2011 05:49:29 +0000 (UTC) From: "Prasad Deshpande (JIRA)" To: dev@lucene.apache.org Message-ID: <1675175669.6732.1296712169517.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] Created: (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are n= o getting indexed correctly. ---------------------------------------------------------------------------= ---------------------------- Key: SOLR-2346 URL: https://issues.apache.org/jira/browse/SOLR-2346 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Wind= ows XP SP1, Machine is booted in Japanese Locale. Reporter: Prasad Deshpande Priority: Critical I am able to successfully index/search non-Engilsh files (like Hebrew, Japa= nese) which was encoded in UTF-8. However, When I tried to index data which= was encoded in local encoding like Big5 for Japanese I could not see the d= esired results. The contents after indexing looked garbled for Big5 encoded= document when I searched for all indexed documents. When I index attached = non utf-8 file it indexes in following way - - - =EF=BF=BD=EF=BF=BD =EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF= =BF=BD - Big5 - zh - zh - 17 - text/plain doc2 Here you said it index file in UTF8 however it seems that non UTF8 file get= s indexed in Big5 encoding. Here I tried fetching indexed data stream in Big5 and converted in UTF8. String id =3D (String) resulDocument.getFirstValue("attr_content"); byte[] bytearray =3D id.getBytes("Big5"); String utf8String =3D new String(bytearray, "UTF-8"); It does not gives expected results. When I index UTF-8 file it indexes like following - - =E3=83=9E=E3=82=A4 =E3=83=8D=E3=83=83=E3=83=88=E3=83=AF=E3=83=BC=E3= =82=AF - UTF-8 - text/plain - sample_jap_unicode.txt - 28 - myfile - text/plain doc2 So, I can index and search UTF-8 data. --=20 This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org