lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chad Small" <>
Subject Lucene with English and Spanish Best Practice?
Date Fri, 20 Aug 2004 21:27:40 GMT

I'm interested in any feedback from anyone who has worked through implementing Internationalization
(I18N) search with Lucene or has ideas for this requirement.  Currently, we're using Lucene
with straight English and are looking to add Spanish to the mix (with maybe more languages
to follow).  

This is our current IndexWriter setup utilizing the PerFieldAnalyzerWrapper:

   PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer());
   analyzer.addAnalyzer(FIELD_TITLE_STARTS_WITH, new WhitespaceAnalyzer());
   analyzer.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
   IndexWriter writer = new IndexWriter(indexDir, analyzer, create);

Would people suggest we switch this over to Snowball so there are English and Spanish Analyzers
and IndexWriters?  Something like this:

PerFieldAnalyzerWrapper analyzerEnglish = new PerFieldAnalyzerWrapper(new SnowballAnalyzer("English"));
analyzerEnglish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new WhitespaceAnalyzer());
analyzerEnglish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
IndexWriter writerEnglish = new IndexWriter(indexDir, analyzerEnglish, create);

PerFieldAnalyzerWrapper analyzerSpanish = new PerFieldAnalyzerWrapper(new SnowballAnalyzer("Spanish"));
analyzerSpanish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new WhitespaceAnalyzer());
analyzerSpanish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
IndexWriter writerSpanish = new IndexWriter(indexDir, analyzerSpanish, create);

Are multiple indexes or mirrors of each index then usually created for every language?  We
currently have 4 indexes that are all English.  Would we then create 4 more that are Spanish?
 Then at search time we would determine the language and which set of indexes to search against,
English or Spanish.

Or another approach could be to add a Spanish field to the existing 4 indexes since most of
the indexes have only one field that will be translated from English to Spanish.

thanks a bunch,

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message