lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominique Bejean <dominique.bej...@eolya.fr>
Subject Re: AW: Best practices for multiple languages?
Date Thu, 20 Jan 2011 09:46:07 GMT
Hi,

During a recent Solr project we needed to index document in a lot of 
languages. The natural solution with Lucene and Solr is to define one 
field per languages. Each field is configured in the schema.xml file to 
use a language specific processing (tokenizing, stop words, stemmer, 
...).  This is really not easy to manage if you have a lot of languages 
and this means that 1) the search interface need to know in which 
language your are searching 2) the search interface can't search in all 
languages at the same time.

So, I decided that the only solution was to index all languages in only 
one field.

Obviously, each language needs to be processed specifically. For this, I 
developped a analyzer that is in charge to redirect content to the 
correct tockenizer, filters and stemmer  accordingly to its language. 
This analyzer is also used at query time. If the user specify the 
language of its query, the query is processed by appropriate tockenizer, 
filters and stemmer otherwise the query is processed by a defaut 
tockenizer, filters and stemmer.

With this solution :

1. I only need one field (or two if I want both stemmed and unstemmed 
processing)
2. The user can search in all document regarless to there language

I hope this help.

Dominique
www.zoonix.fr
www.crawl-anywhere.com



Le 20/01/11 00:29, Bill Janssen a écrit :
> Paul Libbrecht<paul@hoplahup.net>  wrote:
>
>> I did several changes of this sort and the precision and recall
>> measures went better in particular in presence of language-indication
>> failure which happened to be very common in our authoring environment.
> There are two kinds of failures:  no language, or wrong language.
>
> For no language, I fall back to StandardAnalyzer, so I should have
> results similar to yours.  For wrong language, well, I'm using OTS
> trigram-based language guessers, and they're pretty good these days.
>
>>>> Wouldn't it be better to prefer precise matches (a field that is
>>>> analyzed with StandardAnalyzer for example) but also allow matches are
>>>> stemmed.
> Yes, I think it might improve things, but again, by how much?  Stemming is
> better than no stemming, in terms of recall.  But this approach would also
> improve precision.
>
> Bill
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message