lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From heikki <tropic...@gmail.com>
Subject Re: Designing a multilingual index
Date Tue, 03 Jan 2012 14:06:44 GMT
hi,

thanks for your response :

> On the web it is often hard to trust such (e.g. because of people working
in multiple languages, internet cafés...) but... it is your choice.

our web app has a language selector for the user to choose the GUI language

>After?
>Would "shallow matches" in the right language come after "precise matches"
in a wrong language?

yes, that's the idea. Either that or present the results per language in
separate result sets (with sorting options per result set, etc)

> In both solr and simple lucene, using different fields allows different
analyzers, that's how you want things (e.g. a different stemmer per
language).

yes, in the single index solution we do use different analyzers for
different fields

> The important bit is to use query-expansion.
> Given a query of the user (with params or not, with text-queries), expand
it to a query where the "normal text" is expected to be in the right
language, but maybe also in one of the other languages (that
> the browser says, that your platform supports), with less weight of
course.

something like that we do now in a single index solution - results in the
requested language are boosted enough so they're always on top

I don't think though that this addresses what is my main point: the
frequency of terms in different domains (in this case, different languages)
is different for each domain. This means that if the domains are chunked
together in one index, the IDF value for a term is less "accurate" than if
multiple, separate indexes were used. A term is more or less frequent in
one domain or another, for a reason.. Relevance ranking is impacted by
that, and is more accurate if separate indexes are used -- I think this
seems logical.

I just don't know how much impact it really has, and whether it is worth to
deal with it by presenting separate result sets from separate index
searches ..


thanks for your reply !

Heikki Doeleman





On Tue, Jan 3, 2012 at 2:51 PM, Paul Libbrecht <paul@hoplahup.net> wrote:

>
> Le 3 janv. 2012 à 13:56, heikki a écrit :
>
> > In our case, it is "known" in which language the user is searching
> (because
> > he tells us, and if he doesn't, we use the current GUI language).
>
> On the web it is often hard to trust such (e.g. because of people working
> in multiple languages, internet cafés...) but... it is your choice.
>
> > Results
> > are returned so that results in the requested language are ordered on
> top,
> > and within that, ordered by relevance. Results in other languages are
> also
> > returned, and presented after the requested-language results, ordered by
> > relevance.
>
> After?
> Would "shallow matches" in the right language come after "precise matches"
> in a wrong language?
>
> > If the results in the requested language contain say one which has term A
> > and one which has term B, their positions in the relevance ranking
> (within
> > the requested-language results on top) can be influenced by occurrences
> of
> > terms A and B in the other languages, if a single search is used.
> >
> > I agree to the apples/oranges remark: if a term occurs in more than one
> > language, likely its IDF frequency is different for each language, so to
> > have the best relevance ranking there should be separate indexes for each
> > language. And searches should be really separate searches (no
> MultiSearcher
> > which would produce combined relevance scores). So the results should
> also
> > be presented as several, separate result sets.
>
>
> I believe the right solution for this is simple: use different fields per
> langauge.
>
> In both solr and simple lucene, using different fields allows different
> analyzers, that's how you want things (e.g. a different stemmer per
> language).
>
> Using different indexes is certainly a hassle, different fields not really.
>
> The important bit is to use query-expansion.
> Given a query of the user (with params or not, with text-queries), expand
> it to a query where the "normal text" is expected to be in the right
> language, but maybe also in one of the other languages (that the browser
> says, that your platform supports), with less weight of course.
>
> Query expansion is done by post-processing the result of the query-parser
> in my case.
>
> Then you can also differentiate fields which are precise matches and less:
> make one field with exact match (using the whitespace-tokenizer), one field
> with stemmed match (e.g. using the porter family), one field with phonetic
> matches.
>
> Hope it helps.
>
> paul
>
> > Does anyone have experience with this ? Opinions ? Is the improved
> relevance
> > per language worth the "hassle" of having separate indexes, doing
> separate
> > searches and presenting results per language ? We do already take care of
> > using appropriate stopwords/differnt analyzers when indexing and
> searching a
> > particular language, but that's a different issue obviously.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message