lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From heikki <tropic...@gmail.com>
Subject Re: Designing a multilingual index
Date Tue, 03 Jan 2012 14:43:33 GMT
hi Paul,

yes, but my concern isn't about the term-frequency, but rather the
inverted-document-frequency, which also is used in the relevance score and
which takes into account all documents in the index.. in this way the
relevance score of one document is influenced by the contents of all other
documents that are in the same index. This is why it seems logical to me
that if different domains use separate indexes, the relevance scoring is
more accurate.


Kind regards,
Heikki Doeleman




On Tue, Jan 3, 2012 at 3:29 PM, Paul Libbrecht <paul@hoplahup.net> wrote:

> Heikki,
>
> it does solve your main concern: a term in lucene is a pair of a token and
> field name.
> The term frequency is, thus, the frequency of a token in a field.
>
> So the term-frequency of text-stemmed-de:firewall is independent of the
> term-frequency of text-stemmed-en:firewall (for example).
>
> But using the query expansion mechanism, it is likely that both
> term-queries will be present and both contribute to the score. Which is
> correct I think.
>
> paul
>
>
> Le 3 janv. 2012 à 15:06, heikki a écrit :
> >
> >> The important bit is to use query-expansion.
> >> Given a query of the user (with params or not, with text-queries),
> expand
> >> it to a query where the "normal text" is expected to be in the right
> >> language, but maybe also in one of the other languages (that
> >> the browser says, that your platform supports), with less weight of
> > course.
> >
> > something like that we do now in a single index solution - results in the
> > requested language are boosted enough so they're always on top
> >
> > I don't think though that this addresses what is my main point: the
> > frequency of terms in different domains (in this case, different
> languages)
> > is different for each domain. This means that if the domains are chunked
> > together in one index, the IDF value for a term is less "accurate" than
> if
> > multiple, separate indexes were used. A term is more or less frequent in
> > one domain or another, for a reason.. Relevance ranking is impacted by
> > that, and is more accurate if separate indexes are used -- I think this
> > seems logical.
> >
> > I just don't know how much impact it really has, and whether it is worth
> to
> > deal with it by presenting separate result sets from separate index
> > searches ..
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message