lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Janssen <jans...@parc.com>
Subject Re: AW: Best practices for multiple languages?
Date Thu, 20 Jan 2011 09:30:55 GMT
Dominique Bejean <dominique.bejean@eolya.fr> wrote:

> Hi,
> 
> During a recent Solr project we needed to index document in a lot of
> languages. The natural solution with Lucene and Solr is to define one
> field per languages. Each field is configured in the schema.xml file
> to use a language specific processing (tokenizing, stop words,
> stemmer, ...).  This is really not easy to manage if you have a lot of
> languages and this means that 1) the search interface need to know in
> which language your are searching 2) the search interface can't search
> in all languages at the same time.
> 
> So, I decided that the only solution was to index all languages in
> only one field.
> 
> Obviously, each language needs to be processed specifically. For this,
> I developped a analyzer that is in charge to redirect content to the
> correct tockenizer, filters and stemmer  accordingly to its
> language. This analyzer is also used at query time. If the user
> specify the language of its query, the query is processed by
> appropriate tockenizer, filters and stemmer otherwise the query is
> processed by a defaut tockenizer, filters and stemmer.

I'm not sure how much this helps.  My query processing is the same as
yours, but I only index the document with a single analyzer, based on
the language determination.  With your approach, multiple analyses are
all mixed together in a single field, so I'd expect a lower precision
score, due to words that accidentally stem to the same root in multiple
different languages.

Bill

> 
> With this solution :
> 
> 1. I only need one field (or two if I want both stemmed and unstemmed
> processing)
> 2. The user can search in all document regarless to there language
> 
> I hope this help.
> 
> Dominique
> www.zoonix.fr
> www.crawl-anywhere.com
> 
> 
> 
> Le 20/01/11 00:29, Bill Janssen a écrit :
> > Paul Libbrecht<paul@hoplahup.net>  wrote:
> >
> >> I did several changes of this sort and the precision and recall
> >> measures went better in particular in presence of language-indication
> >> failure which happened to be very common in our authoring environment.
> > There are two kinds of failures:  no language, or wrong language.
> >
> > For no language, I fall back to StandardAnalyzer, so I should have
> > results similar to yours.  For wrong language, well, I'm using OTS
> > trigram-based language guessers, and they're pretty good these days.
> >
> >>>> Wouldn't it be better to prefer precise matches (a field that is
> >>>> analyzed with StandardAnalyzer for example) but also allow matches are
> >>>> stemmed.
> > Yes, I think it might improve things, but again, by how much?  Stemming is
> > better than no stemming, in terms of recall.  But this approach would also
> > improve precision.
> >
> > Bill
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message