Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 2081 invoked from network); 11 Apr 2005 10:02:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 11 Apr 2005 10:02:08 -0000 Received: (qmail 29163 invoked by uid 500); 11 Apr 2005 10:01:55 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 29135 invoked by uid 500); 11 Apr 2005 10:01:55 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 29118 invoked by uid 99); 11 Apr 2005 10:01:55 -0000 X-ASF-Spam-Status: No, hits=0.9 required=10.0 tests=DNS_FROM_RFC_ABUSE,FROM_ENDS_IN_NUMS,RCVD_BY_IP,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: domain of eric138@gmail.com designates 64.233.170.198 as permitted sender) Received: from rproxy.gmail.com (HELO rproxy.gmail.com) (64.233.170.198) by apache.org (qpsmtpd/0.28) with ESMTP; Mon, 11 Apr 2005 03:01:53 -0700 Received: by rproxy.gmail.com with SMTP id f1so1916292rne for ; Mon, 11 Apr 2005 03:01:51 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:mime-version:content-type:content-transfer-encoding; b=TCMFI8cBjIwfUjc3nh0UWziILD4y/ErQRBZMKvtYwMchgJCFJ4DBynp5HBogP0Pl8jUSuCKHXvXPQt+bUtQsgpdsa9o4RXAEk8ciM0J2gz1D18HvdeWAhOrpcxsJej65uhMwGB6EtXFd9gIPYNOMAZle4iHRKbeKeSGW1bir7xQ= Received: by 10.38.153.43 with SMTP id a43mr3653690rne; Mon, 11 Apr 2005 03:01:51 -0700 (PDT) Received: by 10.38.12.59 with HTTP; Mon, 11 Apr 2005 03:01:50 -0700 (PDT) Message-ID: Date: Mon, 11 Apr 2005 18:01:50 +0800 From: Eric Chow Reply-To: Eric Chow To: java-user@lucene.apache.org Subject: Urgent, please help, index/search in UTF-8 ??? Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hello, I am a beginner in using Lucene. My files are contains different language (English, Chinese, Portuguese, Japanese and some Asian languages, non-latin languages). They always contain in one file. Therefore, I have to use UTF-8 to save the contents. I am now developing a web-based search engine. I use Lucene to create index for those files and search it in web. The charset of the web page is UTF-8, but it cannot search anything. I try to use some Analyser (CJKAnalyser, ChineseAnalyser, StandardAnalyser, SimpleAnalyser), still failed. Finally, I tested to use original charset, for example, the Chinese contents I used BIG5, and I can search it very well. For those English, of couse, no problem. But I can't use UTF-8 as the charset for documents. Any suggest and examples ? Best regards, Eric --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org