lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: Skewed IDF in multi lingual index, again
Date Thu, 30 Nov 2017 16:35:35 GMT
This is unfortunately not what we want. Some customers use filters to restrict language, but
some customers don't. They want to be able to find documents in all languages, so we use user
preference to get their local language on top. Except for very relevant documents in foreign
languages, hence the deboost is not too low.

Thanks,
Markus

 
-----Original message-----
> From:Walter Underwood <wunder@wunderwood.org>
> Sent: Thursday 30th November 2017 17:29
> To: solr-user@lucene.apache.org
> Subject: Re: Skewed IDF in multi lingual index, again
> 
> I’ve occasionally considered using Unicode language tags (U+E001 and friends) on each
term. That would make a term specific to a language, so we would get [en]LaserJet, [fr]LaserJet,
[de]LaserJet, and so on. But that is a pretty big hammer, because it restricts matches to
the same language. If the entire document is in one language, might as well use a filter query
for that language. The tags would work for multiple languages in one document.
> 
> Maybe make the untagged term a synonym. For cross-language terms like “LaserJet”,
the untagged one would have worse idf.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Nov 30, 2017, at 8:14 AM, Markus Jelsma <markus.jelsma@openindex.io> wrote:
> > 
> > Hello,
> > 
> > We already discussed this problem five years ago [1]. In short: documents in foreign
languages are scored higher for some terms.
> > 
> > It was solved back then by using docCount instead of maxDoc when calculating idf,
it worked really well! But, probably due to index changes, the problem is back for some terms,
mostly proper nouns, well, just like five years ago.
> > 
> > We already deboost documents by 0.7 that are not in the user's preference language
but in some cases it is not enough. I can go on by reducing that boost but that's not what
i prefer.
> > 
> > I'd like to know if there are additional tricks to solve the problem.
> > 
> > Many thanks!
> > Markus
> > 
> > [1] http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html
> 
> 

Mime
View raw message