lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <>
Subject Re: Designing a multilingual index
Date Wed, 31 Mar 2010 20:00:14 GMT

I'm doing exactly that.
And I think there's one crucial advantage aside: multilingual queries:  
if your user requests "segment" you have no way to know which language  
he is searching for; erm, well, you have the user-language(s) (through  
the browser Accept-Language header for example) so you'll understand  
he meant to search in french but would accept that he wants also  
matches in others languages, just less boosted.

So I "expand" the query from "segment" in a french environment to:
   title-fr:segment^1.44 wor title-en:segment^1.2 ... wor text- 
fr:segment:1.2 wor text-en:segment:1.1 ...
(wor is my naming of the weighted-or which is the normal thing of a  
"should" boolean query)

Surprisingly i haven't seen many people talk about "query expansion"  
but I think it is rather systematic and it could become more part of  
the culture of search engines...


Le 31-mars-10 à 18:20, David Vergnaud a écrit :

> The second method I've thought of is to have all languages in the  
> same index and use different analyzers on fields that require  
> analysis. In order to do that, I was thinking of extending the names  
> of the fields with the names of the languages -- like e.g. "content- 
> en" vs "content-fr" vs "content-xx" (for "no language recognized").  
> Then using a customized analyzer, the name of the field would be  
> parsed in method tokenStream and the proper language-dependent  
> analyzer would be selected.
> The drawback of this method, as I see it, is that the number of  
> fields in the index increases drastically, which in turn means that  
> building queries becomes rather cumbersome -- but still doable,  
> assuming (which also is the case) that I know the exact list of  
> languages I'm dealing with. Also, it means that Lucene would be  
> searching in non-existing fields in most documents, as I doubt many  
> of them would contain *all* languages. But it keeps the complete  
> information about one document gathered in one place and requires  
> searching only one index.
> As I said, I've already implemented the first method some time ago  
> and it works fine. I've only just thought about the second one when  
> I read about this PerFieldAnalyzerWrapper, which allows to do just  
> what I want in the second method. Since my index won't be that big  
> at first, I doubt either architecture would prove to be much more  
> efficient than the other, however I want to use a scaleable design  
> right from the start, so I was wondering whether some Lucene gurus  
> might give me some insights as to what in their eyes would be the  
> better approach -- or whether there might be a different, much  
> better technique I haven't thought of.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message