David,
I'm doing exactly that.
And I think there's one crucial advantage aside: multilingual queries:
if your user requests "segment" you have no way to know which language
he is searching for; erm, well, you have the user-language(s) (through
the browser Accept-Language header for example) so you'll understand
he meant to search in french but would accept that he wants also
matches in others languages, just less boosted.
So I "expand" the query from "segment" in a french environment to:
title-fr:segment^1.44 wor title-en:segment^1.2 ... wor text-
fr:segment:1.2 wor text-en:segment:1.1 ...
(wor is my naming of the weighted-or which is the normal thing of a
"should" boolean query)
Surprisingly i haven't seen many people talk about "query expansion"
but I think it is rather systematic and it could become more part of
the culture of search engines...
paul
Le 31-mars-10 à 18:20, David Vergnaud a écrit :
> The second method I've thought of is to have all languages in the
> same index and use different analyzers on fields that require
> analysis. In order to do that, I was thinking of extending the names
> of the fields with the names of the languages -- like e.g. "content-
> en" vs "content-fr" vs "content-xx" (for "no language recognized").
> Then using a customized analyzer, the name of the field would be
> parsed in method tokenStream and the proper language-dependent
> analyzer would be selected.
> The drawback of this method, as I see it, is that the number of
> fields in the index increases drastically, which in turn means that
> building queries becomes rather cumbersome -- but still doable,
> assuming (which also is the case) that I know the exact list of
> languages I'm dealing with. Also, it means that Lucene would be
> searching in non-existing fields in most documents, as I doubt many
> of them would contain *all* languages. But it keeps the complete
> information about one document gathered in one place and requires
> searching only one index.
>
> As I said, I've already implemented the first method some time ago
> and it works fine. I've only just thought about the second one when
> I read about this PerFieldAnalyzerWrapper, which allows to do just
> what I want in the second method. Since my index won't be that big
> at first, I doubt either architecture would prove to be much more
> efficient than the other, however I want to use a scaleable design
> right from the start, so I was wondering whether some Lucene gurus
> might give me some insights as to what in their eyes would be the
> better approach -- or whether there might be a different, much
> better technique I haven't thought of.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|