lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Vergnaud <>
Subject Re: Designing a multilingual index
Date Thu, 01 Apr 2010 08:14:01 GMT

thanks Paul for your input. I'm gonna try the "localized field" variant and see how it works
for me. 

I think your idea of automatically boosting the user language is neat, but it should definitely
be possible to disable this boosting... Most users have no idea about the language settings
in their browser, which drive the contents of the "Accept-Language" header, and e.g. here
in Switzerland there's many a foreigner whose prefered language is not French or German or
Italian, so forcing a boost on the user could definitely result in a poor user experience.

Does anyone have any technical arguments why the one (several indices) or the other (localized
fields in a single index) method might be better? 



----- Original Message ----
From: Paul Libbrecht <>
Sent: Wed, March 31, 2010 10:00:14 PM
Subject: Re: Designing a multilingual index


I'm doing exactly that.
And I think there's one crucial advantage aside: multilingual queries: if your user requests
"segment" you have no way to know which language he is searching for; erm, well, you have
the user-language(s) (through the browser Accept-Language header for example) so you'll understand
he meant to search in french but would accept that he wants also matches in others languages,
just less boosted.

So I "expand" the query from "segment" in a french environment to:
  title-fr:segment^1.44 wor title-en:segment^1.2 ... wor text-fr:segment:1.2 wor text-en:segment:1.1
(wor is my naming of the weighted-or which is the normal thing of a "should" boolean query)

Surprisingly i haven't seen many people talk about "query expansion" but I think it is rather
systematic and it could become more part of the culture of search engines...


Le 31-mars-10 à 18:20, David Vergnaud a écrit :

> The second method I've thought of is to have all languages in the same index and use
different analyzers on fields that require analysis. In order to do that, I was thinking of
extending the names of the fields with the names of the languages -- like e.g. "content-en"
vs "content-fr" vs "content-xx" (for "no language recognized"). Then using a customized analyzer,
the name of the field would be parsed in method tokenStream and the proper language-dependent
analyzer would be selected.
> The drawback of this method, as I see it, is that the number of fields in the index increases
drastically, which in turn means that building queries becomes rather cumbersome -- but still
doable, assuming (which also is the case) that I know the exact list of languages I'm dealing
with. Also, it means that Lucene would be searching in non-existing fields in most documents,
as I doubt many of them would contain *all* languages. But it keeps the complete information
about one document gathered in one place and requires searching only one index.
> As I said, I've already implemented the first method some time ago and it works fine.
I've only just thought about the second one when I read about this PerFieldAnalyzerWrapper,
which allows to do just what I want in the second method. Since my index won't be that big
at first, I doubt either architecture would prove to be much more efficient than the other,
however I want to use a scaleable design right from the start, so I was wondering whether
some Lucene gurus might give me some insights as to what in their eyes would be the better
approach -- or whether there might be a different, much better technique I haven't thought

To unsubscribe, e-mail:
For additional commands, e-mail:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message