Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 8959 invoked from network); 20 Aug 2004 21:28:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 20 Aug 2004 21:28:01 -0000 Received: (qmail 99223 invoked by uid 500); 20 Aug 2004 21:27:45 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 99118 invoked by uid 500); 20 Aug 2004 21:27:44 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 99030 invoked by uid 99); 20 Aug 2004 21:27:43 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [12.106.30.20] (HELO ns1.definityhealth.com) (12.106.30.20) by apache.org (qpsmtpd/0.27.1) with ESMTP; Fri, 20 Aug 2004 14:27:42 -0700 Received: from elwood.definityhealth.com (elwood [10.19.3.122]) by ns1.definityhealth.com (8.11.6/8.11.6) with ESMTP id i7KLReg11430 for ; Fri, 20 Aug 2004 16:27:40 -0500 X-MimeOLE: Produced By Microsoft Exchange V6.0.6487.1 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: Lucene with English and Spanish Best Practice? Date: Fri, 20 Aug 2004 16:27:40 -0500 Message-ID: <2DC164E17BF00741B75A02EDA6D4A897017B79A0@elwood.definityhealth.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Lucene with English and Spanish Best Practice? Thread-Index: AcSG/IWZJCU9cdhxQDCB1MB74oKglQ== From: "Chad Small" To: X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Hello, I'm interested in any feedback from anyone who has worked through = implementing Internationalization (I18N) search with Lucene or has ideas = for this requirement. Currently, we're using Lucene with straight = English and are looking to add Spanish to the mix (with maybe more = languages to follow). =20 This is our current IndexWriter setup utilizing the = PerFieldAnalyzerWrapper: PerFieldAnalyzerWrapper analyzer =3D new PerFieldAnalyzerWrapper(new = StandardAnalyzer()); analyzer.addAnalyzer(FIELD_TITLE_STARTS_WITH, new = WhitespaceAnalyzer()); analyzer.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer()); IndexWriter writer =3D new IndexWriter(indexDir, analyzer, create); Would people suggest we switch this over to Snowball so there are = English and Spanish Analyzers and IndexWriters? Something like this: PerFieldAnalyzerWrapper analyzerEnglish =3D new = PerFieldAnalyzerWrapper(new SnowballAnalyzer("English")); analyzerEnglish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new = WhitespaceAnalyzer()); analyzerEnglish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer()); IndexWriter writerEnglish =3D new IndexWriter(indexDir, analyzerEnglish, = create); PerFieldAnalyzerWrapper analyzerSpanish =3D new = PerFieldAnalyzerWrapper(new SnowballAnalyzer("Spanish")); analyzerSpanish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new = WhitespaceAnalyzer()); analyzerSpanish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer()); IndexWriter writerSpanish =3D new IndexWriter(indexDir, analyzerSpanish, = create); Are multiple indexes or mirrors of each index then usually created for = every language? We currently have 4 indexes that are all English. = Would we then create 4 more that are Spanish? Then at search time we = would determine the language and which set of indexes to search against, = English or Spanish. Or another approach could be to add a Spanish field to the existing 4 = indexes since most of the indexes have only one field that will be = translated from English to Spanish. thanks a bunch, chad. --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org