lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chad Small" <Chad.Sm...@definityhealth.com>
Subject RE: Lucene with English and Spanish Best Practice?
Date Mon, 23 Aug 2004 12:36:44 GMT

Thanks for the info Grant.

<<
As for indexes, do you anticipate adding more fields later in Spanish? 
Is the content just a translation of the English, or do you have
separate conetent in Spanish?  Are your users querying in only one
language (cross-lingual) or are the Spanish speakers only querying
against Spanish content?
>>
Our fields are pretty much going to be one-for-one between English and Spanish (a translation
of current content from English to Spanish).  Something like title_en and title_sp, body_en
and body_sp, keywords_en and keywords_sp.  Our users will be querying cross-lingual.  So I
see your point, it looks like it would be easier if we added the Spanish fields to our current
indexes, then we wouldn't have to filter out same results between English and Spanish indexes.


<<
I am doing Arabic and English (and have done Spanish, French, and
Japanese in the past), although our cross-lingual system supports any
languages that you have resources for.
>>
Did you use Snowball for the Spanish?  Or is there just a Lucene Spanish Analyzer available
(I couldn't find one).  Or do people just use something like a plain old StandardAnalyzer
to index and query Spanish content?  I'm a little confused on the Snowball project, is it
a multi-language Stemmer Analyzer for Lucene?  We just use plan old Standard and Whitespace
Analyzers now for our English content.  Can we just use those same Analyzers for Spanish content?
 Or would it be better to use the Snowball project?

thanks,
chad.
  

-----Original Message-----
From: Grant Ingersoll [mailto:GSIngers@syr.edu]
Sent: Saturday, August 21, 2004 2:16 PM
To: lucene-user@jakarta.apache.org
Subject: Re: Lucene with English and Spanish Best Practice?


 I think the Snowball stuff works well, although I have only used the
English Porter stemmer implementation.

As for indexes, do you anticipate adding more fields later in Spanish? 
Is the content just a translation of the English, or do you have
separate conetent in Spanish?  Are your users querying in only one
language (cross-lingual) or are the Spanish speakers only querying
against Spanish content?

I am doing Arabic and English (and have done Spanish, French, and
Japanese in the past), although our cross-lingual system supports any
languages that you have resources for.  We lean towards separate
indexes, but mostly b/c they are based on separate content.  The key is
you have to be able to match up the analysis of the query with the
analysis of the index.  Having a mixed index may make this more
difficult.  If you have a mixed index would you filter out Spanish
results that had hits from an English query?  For instance, what if the
query was a term that was common to both languages (banana, mosquito,
etc.) or are you requiring the user to specify which fields they are
searching against.  I guess we really need to know more about how your
user is going to be interacting.

-Grant

>>> Chad.Small@definityhealth.com 8/20/2004 5:27:40 PM >>>
Hello,

I'm interested in any feedback from anyone who has worked through
implementing Internationalization (I18N) search with Lucene or has ideas
for this requirement.  Currently, we're using Lucene with straight
English and are looking to add Spanish to the mix (with maybe more
languages to follow).  

This is our current IndexWriter setup utilizing the
PerFieldAnalyzerWrapper:

   PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new
StandardAnalyzer());
   analyzer.addAnalyzer(FIELD_TITLE_STARTS_WITH, new
WhitespaceAnalyzer());
   analyzer.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
   IndexWriter writer = new IndexWriter(indexDir, analyzer, create);

Would people suggest we switch this over to Snowball so there are
English and Spanish Analyzers and IndexWriters?  Something like this:

PerFieldAnalyzerWrapper analyzerEnglish = new
PerFieldAnalyzerWrapper(new SnowballAnalyzer("English"));
analyzerEnglish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new
WhitespaceAnalyzer());
analyzerEnglish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
IndexWriter writerEnglish = new IndexWriter(indexDir, analyzerEnglish,
create);

PerFieldAnalyzerWrapper analyzerSpanish = new
PerFieldAnalyzerWrapper(new SnowballAnalyzer("Spanish"));
analyzerSpanish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new
WhitespaceAnalyzer());
analyzerSpanish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
IndexWriter writerSpanish = new IndexWriter(indexDir, analyzerSpanish,
create);


Are multiple indexes or mirrors of each index then usually created for
every language?  We currently have 4 indexes that are all English. 
Would we then create 4 more that are Spanish?  Then at search time we
would determine the language and which set of indexes to search against,
English or Spanish.

Or another approach could be to add a Spanish field to the existing 4
indexes since most of the indexes have only one field that will be
translated from English to Spanish.


thanks a bunch,
chad.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org 
For additional commands, e-mail: lucene-user-help@jakarta.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message