lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <p...@hoplahup.net>
Subject Re: Best practices for multiple languages?
Date Thu, 20 Jan 2011 21:56:20 GMT
Isn't this approach somewhat bad for term-frequency?

Words that would appear in several languages would be a lot more frequent (hence less significative).

I'm still preferring the split-field method with a proper query expansion.
This way, the term-frequency is evaluated on the corpus of one language.

Dominique, in your case, at least if on the web, you have:
- the user's preferred language (if defined in a profile)
- the list of languages the browser says it accepts
And that can easily be limited to around 8 so that you cover any language the user is expecting
to search.

paul


Le 20 janv. 2011 à 10:46, Dominique Bejean a écrit :

> Hi,
> 
> During a recent Solr project we needed to index document in a lot of languages. The natural
solution with Lucene and Solr is to define one field per languages. Each field is configured
in the schema.xml file to use a language specific processing (tokenizing, stop words, stemmer,
...).  This is really not easy to manage if you have a lot of languages and this means that
1) the search interface need to know in which language your are searching 2) the search interface
can't search in all languages at the same time.
> 
> So, I decided that the only solution was to index all languages in only one field.
> 
> Obviously, each language needs to be processed specifically. For this, I developped a
analyzer that is in charge to redirect content to the correct tockenizer, filters and stemmer
 accordingly to its language. This analyzer is also used at query time. If the user specify
the language of its query, the query is processed by appropriate tockenizer, filters and stemmer
otherwise the query is processed by a defaut tockenizer, filters and stemmer.
> 
> With this solution :
> 
> 1. I only need one field (or two if I want both stemmed and unstemmed processing)
> 2. The user can search in all document regarless to there language
> 
> I hope this help.
> 
> Dominique
> www.zoonix.fr
> www.crawl-anywhere.com
> 
> 
> 
> Le 20/01/11 00:29, Bill Janssen a écrit :
>> Paul Libbrecht<paul@hoplahup.net>  wrote:
>> 
>>> I did several changes of this sort and the precision and recall
>>> measures went better in particular in presence of language-indication
>>> failure which happened to be very common in our authoring environment.
>> There are two kinds of failures:  no language, or wrong language.
>> 
>> For no language, I fall back to StandardAnalyzer, so I should have
>> results similar to yours.  For wrong language, well, I'm using OTS
>> trigram-based language guessers, and they're pretty good these days.
>> 
>>>>> Wouldn't it be better to prefer precise matches (a field that is
>>>>> analyzed with StandardAnalyzer for example) but also allow matches are
>>>>> stemmed.
>> Yes, I think it might improve things, but again, by how much?  Stemming is
>> better than no stemming, in terms of recall.  But this approach would also
>> improve precision.
>> 
>> Bill
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message