From java-user-return-26116-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Wed Feb 14 08:50:22 2007 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 95381 invoked from network); 14 Feb 2007 08:50:21 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 14 Feb 2007 08:50:21 -0000 Received: (qmail 803 invoked by uid 500); 14 Feb 2007 08:50:20 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 99911 invoked by uid 500); 14 Feb 2007 08:50:18 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 99900 invoked by uid 99); 14 Feb 2007 08:50:18 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Feb 2007 00:50:18 -0800 X-ASF-Spam-Status: No, hits=0.3 required=10.0 tests=MAILTO_TO_SPAM_ADDR X-Spam-Check-By: apache.org Received-SPF: neutral (herse.apache.org: local policy) Received: from [169.229.70.167] (HELO rescomp.berkeley.edu) (169.229.70.167) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Feb 2007 00:50:08 -0800 Received: by rescomp.berkeley.edu (Postfix, from userid 1007) id 7B48C5B766; Wed, 14 Feb 2007 00:49:48 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by rescomp.berkeley.edu (Postfix) with ESMTP id 74DC67F403 for ; Wed, 14 Feb 2007 00:49:48 -0800 (PST) Date: Wed, 14 Feb 2007 00:49:48 -0800 (PST) From: Chris Hostetter To: java-user@lucene.apache.org Subject: Re: encoding question. In-Reply-To: <34b8543c0702132146g76abec8t885ce611a84ccabd@mail.gmail.com> Message-ID: References: <34b8543c0702132146g76abec8t885ce611a84ccabd@mail.gmail.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org Internally Lucene deals with pure Java Strings; when writing those strings to and reading those strings back from disk, Lucene allways uses the stock Java "modified UTF-8" format, regardless of what your file.encoding system property may be. typcially when people have encoding problems in their lucene applications, the origin of hte problem is in the way they fetch the data before indexing it ... if you can make a String object, and System.out.println that string and see what you expect, then handing that string to Lucene as a field value should work fine. what exactly is the "value" object you are calling getBytes on? ... if it's another String, then you've already got serious problems -- i can't imagine any situation where fetching the bytes from a String in one charset and using those bytes to construct another string (either in a different charset, or in the system default charset) would make any sense at all. wherever your original binary data is coming from (files on disk, network socket, etcc...) that's when you should be converting those bytes into chars using whatever charset you know those bytes represent. : Date: Wed, 14 Feb 2007 09:16:58 +0330 : From: Mohammad Norouzi : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: encoding question. : : Hi : I want to index data with utf-8 encoding, so when adding field to a document : I am using the code new String(value.getBytes("utf-8")) : in the other hand, when I am going to search I was using the same snippet : code to convert to utf-8 but it did not work so finally I found somewhere : that had been said to use new String(valueToSearch.getBytes("cp1252"),"UTF8") : and it worked fine but I still has some problem. : first, some characters are weird when I get result from lucene, It seems it : is in cp1252 encoding. : second, if the java environment property "file.encoding" not been cp1252 the : result is completely in incorrect encoding. so I must change this property : using System.setProperty("file.encoding","cp1252") : : is lucene neglect my utf-8 encoding and proceed indexing data using cp1252? : how can I correct weird characters I received by searching? : : Thank you very much in advance. : -- : Regards, : Mohammad : -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org