lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Jaquemet <>
Subject Re: Multiple Language Indexing and Searching
Date Tue, 06 Sep 2005 07:29:59 GMT
As far as your usage is concerned, it seems to be the right approach, 
and I think the StandardAnalyzer does the job pretty right when it has 
to deal with whatever language you want.
Though, note that it won't deal with all languages' stop words but the 
English ones, unless specified at index time  But then if you change the 
stop words at index time, what should you use at query time, some query 
it won't work well.

But as far as I am concerned, each content (content in the sense of a 
CMS) is known to have multiple language, and each of these language 
*can* be indexed separately with no problem at all, and therefore a 
dedicated analyser could be use. So I was wondering whether my approach 
could be the right one of if it was over complex, and could introduce 
some problem I could not see... (My approach being: one index per language)
Advantages are:
- you always have the same analyzer for one index, so if want to benefit 
from some indexing capabilities in one language (stemmer, filter.. 
whatever), you can!
- Should you need to search in all the language, you just need to do the 
query on every single index and you still benefit from each analyzer.
Inconvenients are
- You have to deal with as much indices as you have languages, but then 
again, if you do a search in only one language, it becomes an 
performance advantage I think.
- You have to merge results from different index, this is a probleme 
when dealing with score, any suggestions?
- Unless i'm wrong, you cannot use a MultipleSeacher, because only one 
analyzer can be specified, and not one analyzer per searcher (if someone 
could correct me if I'm wrong..)
 - others ??

I don't know if the developpers of lucene would agree, but from what 
I've been browsing on the ML archives, those multiple language issues 
seems to arrise quite often in the mailing list, and maybe some articles 
like "best practices", "do's and don'ts" or "Lucene Architecture in 
multiple language environement",  might be really nice to see :) If some 
of you have the time and the experience to write them I'll be really 
thankful! :)


Hacking Bear wrote:

> I have the similar problem to deal with. In fact, a lot of times, the 
>documents do not have any lanugage information or it may contain text in 
>multiple languages. Further, the user would not like to always supply this 
>information. Also the user may very well be interested in documents in 
>multiple language.
> I think Google and other search engines allow indexing multi-lanugage 
>documents. For example, if you google "Java", there will many matched 
>documents in lanugages other than English.
> The only assumption we can make is that the document text are converted to 
>Unicode before feeding to Lucene.
>So I think the solution should be (1) create one index for all lanugage (2) 
>add an advisory attribute like "lang" to specify the language of the 
>document; if the language is unknown, just leave it empty or set to "ANY"; 
>(3) based on the code pages of the upcoming Unicode character, we 
>automatically switch among different analyzers to index the fragments of the 
>text; (4) during search, unless the user explicitly requesting documents in 
>certain language, we return all matches regardless of lanugage.
> I have browsed through the Lucene and contributed source codes, but I 
>cannot tell which analyzer is suitable for use (in (3).) While the logic for 
>such an analyzer is probably not too complicate, it seems to demand quite 
>some Unicode knowledge to create one.
> Is my approach the right one? Is there an analyzer suitable to use?
> Thanks.
>- HB
> On 9/5/05, Olivier Jaquemet <> wrote: 
>>I'd like to go in details regarding issues that occurs when you want to
>>index and search contents in multiple languages.
>>I have read Lucene in Action book, and many thread on this mailing list,
>>the most interesting so far being this one:
>>The solution choosen/recommended by Doug Cutting in this message:
>>is the number '2/':
>>Having one index for all languages one Document per content's language
>>with a field specify its language, and using a query filter when 
>>While I think it is a good solution:
>>- If you have N languages, if you search for something in 1 language,
>>you are going to search an index N times too large.
>>Wouldn't it be better to have N indices for N languages? That way, each
>>index could benefit of its specialized analyser, and if you need to
>>search in multiple languages, you just need to merge result of those
>>differents analyzer.
>>- If you have contents in multiple language like we do, and by that I
>>don't mean multiple contents each one having its own language, but
>>multiple content, each one being in many languages. You are going to
>>have a N to 1, Document/content relation in the index.
>>As far as update, delete, and search in multiple language are concerned,
>>wouldn't it be simpler to alway keep a 1 to 1 Document/content relation
>>in an index?
>>As you may have guess, my original thought, even before I read those
>>thread, was that the solution number 3. might be more flexible/modular
>>than the others, of course it also has its drawbacks:
>>- performance issue when doing multiple language search, specially when
>>merging results of different index.
>>- more complex to code
>>- other?
>>Can you clarify on this?
>>What solutions all of you have choosen til now regarding indexing and
>>searching of multiple content in multiple language ?
>>To unsubscribe, e-mail:
>>For additional commands, e-mail:

Olivier Jaquemet <>
Ingénieur R&D Jalios S.A.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message