Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: Lucene with English and Spanish Best Practice?
Date: Fri, 20 Aug 2004 16:27:40 -0500
Message-ID: 
 <2DC164E17BF00741B75A02EDA6D4A897017B79A0@elwood.definityhealth.com>
Thread-Topic: Lucene with English and Spanish Best Practice?
Thread-Index: AcSG/IWZJCU9cdhxQDCB1MB74oKglQ==
From: "Chad Small" <Chad.Small@definityhealth.com>
To: <lucene-user@jakarta.apache.org>

Hello,

I'm interested in any feedback from anyone who has worked through =
implementing Internationalization (I18N) search with Lucene or has ideas =
for this requirement.  Currently, we're using Lucene with straight =
English and are looking to add Spanish to the mix (with maybe more =
languages to follow). =20

This is our current IndexWriter setup utilizing the =
PerFieldAnalyzerWrapper:

   PerFieldAnalyzerWrapper analyzer =3D new PerFieldAnalyzerWrapper(new =
StandardAnalyzer());
   analyzer.addAnalyzer(FIELD_TITLE_STARTS_WITH, new =
WhitespaceAnalyzer());
   analyzer.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
   IndexWriter writer =3D new IndexWriter(indexDir, analyzer, create);

Would people suggest we switch this over to Snowball so there are =
English and Spanish Analyzers and IndexWriters?  Something like this:

PerFieldAnalyzerWrapper analyzerEnglish =3D new =
PerFieldAnalyzerWrapper(new SnowballAnalyzer("English"));
analyzerEnglish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new =
WhitespaceAnalyzer());
analyzerEnglish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
IndexWriter writerEnglish =3D new IndexWriter(indexDir, analyzerEnglish, =
create);

PerFieldAnalyzerWrapper analyzerSpanish =3D new =
PerFieldAnalyzerWrapper(new SnowballAnalyzer("Spanish"));
analyzerSpanish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new =
WhitespaceAnalyzer());
analyzerSpanish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
IndexWriter writerSpanish =3D new IndexWriter(indexDir, analyzerSpanish, =
create);


Are multiple indexes or mirrors of each index then usually created for =
every language?  We currently have 4 indexes that are all English.  =
Would we then create 4 more that are Spanish?  Then at search time we =
would determine the language and which set of indexes to search against, =
English or Spanish.

Or another approach could be to add a Spanish field to the existing 4 =
indexes since most of the indexes have only one field that will be =
translated from English to Spanish.


thanks a bunch,
chad.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org