lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <>
Subject Re: Best practices for multiple languages?
Date Wed, 19 Jan 2011 18:16:25 GMT
Because it does not find "junks" when you search "junk".
Or... chevaux when you search cheval.


Le 19 janv. 2011 à 18:59, Luca Rondanini a écrit :

> why not just using the StandardAnalyzer? it works pretty well even with
> Asian languages!
> On Wed, Jan 19, 2011 at 12:23 AM, Shai Erera <> wrote:
>> If you index documents, each in a different language, but all its fields
>> are
>> of the same language, then what you can do is the following:
>> Create separate indexes per language
>> -------------------------------------------------------
>> This will work and is not too hard to set up. Requires some maintenance
>> code
>> (e.g. directing a search request against the relevant index) but nothing
>> too
>> complicated. The advantage of using this approach is that you don't risk
>> running into issues like search for "die" when the language is German, yet
>> you will find documents in English indexed w/ that word. So your searches
>> are "language safe". A disadvantage is that if you ever require to do
>> cross-language operation, like search two languages, you need to do search
>> federation which is less good. Also, maintenance becomes a slight pain,
>> because you e.g. need to optimize multiple indexes, make sure they don't
>> try
>> to optimize at once, resulting in a sudden burst of IO.
>> Create one index
>> -------------------------
>> Here, you'd use IndexWriter.addDocument(doc, analyzer) method and pass the
>> proper Analyzer per the doc's language. That way, all your documents are
>> located in the same index so administration is really simple. They also
>> don't step on each other toes - each document is analyzed exactly as it
>> should. You might get into weird situations like the "die" example
>> (fetching
>> a document in incorrect language), but that's easily solvable by indexing
>> for each document a "language" field and use it as a Filter during the
>> search. You can cache that Filter so that its posting list isn't traversed
>> for every query but instead only once.
>> We use the second approach and we're required to support 32 languages.
>> While
>> in most deployments the number never exceeds 3-4 languages, I know of some
>> that handle > 10. If you're careful enough, it just works.
>> Hope this helps.
>> Shai
>> On Wed, Jan 19, 2011 at 9:44 AM, Paul Libbrecht <> wrote:
>>> But for this, you need a skillfully designed:
>>> - set of fields
>>> - multiplexing analyzer
>>> - query expansion
>>> In one of my projects, we do not split language by fields and it's a
>>> pain... I'm having recurring issues in one sense or the other.
>>> - the "die" example that Oti s mentioned is a good one: stop-word in
>>> German, essential verb in English
>>> - I had recently issues with the contribution of the word Fourier (for
>> the
>>> name of series): in English it stays fourier, in French in becomes fouri.
>>> So: if the resource is contributed in French, the indexed value is fouri,
>>> English seekers won't find it; if the resource is contributed in English,
>>> French seekers won't find it.
>>> So my last lesson: always have a whitespace-lowercase unstemmed field
>> also
>>> at hand and prefer it over the others in your query expansion.
>>> A wiki page should probably be made.
>>> paul
>>> Le 19 janv. 2011 à 07:53, Vinaya Kumar Thimmappa a écrit :
>>>> I think we should be using lucene with snowball jar's which means one
>>> index for all languages (ofcourse size of index is always a matter of
>>> concerns).
>>>> Hope this helps.
>>>> -vinaya
>>>> On Tuesday 18 January 2011 11:23 PM, Clemens Wyss wrote:
>>>>> What is the "best practice" to support multiple languages, i.e.
>>> Lucene-Documents that have multiple language content/fields?
>>>>> Should
>>>>> a) each language be indexed in a seperate index/directory or should
>>>>> b) the Documents (in a single directory) hold the diverse localized
>>> fields?
>>>>> We most often will be searching "language dependent" which (at least
>>> performance wise) mandates one-directory-per-language...
>>>>> Any (lucene specific) white papers on this topic?
>>>>> Thx in advance
>>>>> Clemens
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail:
>>>>> For additional commands, e-mail:
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message