From java-user-return-14717-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Fri Jun 03 10:04:11 2005 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 41929 invoked from network); 3 Jun 2005 10:04:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 3 Jun 2005 10:04:11 -0000 Received: (qmail 52986 invoked by uid 500); 3 Jun 2005 10:04:04 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 52950 invoked by uid 500); 3 Jun 2005 10:04:04 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 52937 invoked by uid 99); 3 Jun 2005 10:04:03 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from ehatchersolutions.com (HELO ehatchersolutions.com) (69.55.225.129) by apache.org (qpsmtpd/0.28) with ESMTP; Fri, 03 Jun 2005 03:04:03 -0700 Received: by ehatchersolutions.com (Postfix, from userid 504) id 2B08C13E2008; Fri, 3 Jun 2005 06:03:47 -0400 (EDT) Received: from [192.168.1.100] (va-chrvlle-cad1-bdgrp1-4b-b-169.chvlva.adelphia.net [68.169.41.169]) by ehatchersolutions.com (Postfix) with ESMTP id D5FB413E2007 for ; Fri, 3 Jun 2005 06:03:33 -0400 (EDT) Mime-Version: 1.0 (Apple Message framework v730) In-Reply-To: <27DC6F1B1943DE42A2BBC21BEC8974DB7D1B45@POSTAL.corp.dynix.com> References: <27DC6F1B1943DE42A2BBC21BEC8974DB7D1B45@POSTAL.corp.dynix.com> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <7FA87315-305D-475F-8CC2-6EEA7B85930E@ehatchersolutions.com> Content-Transfer-Encoding: 7bit From: Erik Hatcher Subject: Re: Indexing multiple languages Date: Fri, 3 Jun 2005 06:03:31 -0400 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.730) X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on javelina X-Spam-Status: No, score=-1.1 required=5.0 tests=AWL,BAYES_00, RCVD_IN_NJABL_DUL,RCVD_IN_SORBS_DUL autolearn=no version=3.0.1 X-Spam-Level: X-Virus-Checked: Checked X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote: > Btw, I did try running the lucene demo (web template) to index the > HTML > files after I added one including English and Chinese characters. > I was > not able to search for any Chinese in that HTML file (returned no > hits). > I wonder whether I need to change some of the java programs to index > Chinese and/or accept Chinese as search term. I was able to search > for > the HTML file if I used English word that appeared in the added HTML > file. Bob - Andy provided thorough information on the StandardAnalyzer issue (in short, it deals with Unicode directly not encodings). As for the Lucene demo - you will have to adjust it to read the files in the proper encoding. The IndexFiles program indexes files using the default encoding which won't be sufficient for your purpose. The two files to check are HtmlDocument and FileDocument. These files read the HTML and text files that the demo indexes. Erik --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org