Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 43991D04B for ; Thu, 8 Nov 2012 16:09:17 +0000 (UTC) Received: (qmail 13245 invoked by uid 500); 8 Nov 2012 16:09:13 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 12940 invoked by uid 500); 8 Nov 2012 16:09:13 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 12927 invoked by uid 99); 8 Nov 2012 16:09:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Nov 2012 16:09:13 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=MIME_QP_LONG_LINE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [178.21.113.82] (HELO mail.openindex.io) (178.21.113.82) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Nov 2012 16:09:05 +0000 Received: from localhost (localhost [127.0.0.1]) by mail.openindex.io (Postfix) with ESMTP id 1FE7AFC002 for ; Thu, 8 Nov 2012 16:13:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at mail.openindex.io Received: from mail.openindex.io ([127.0.0.1]) by localhost (mail.openindex.io [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zjp8si+ZXDfp for ; Thu, 8 Nov 2012 16:13:42 +0000 (UTC) Received: from mail.openindex.io (localhost [127.0.0.1]) by mail.openindex.io (Postfix) with ESMTP id F0351FC001 for ; Thu, 8 Nov 2012 16:13:41 +0000 (UTC) Subject: Skewed IDF in multi lingual index From: =?utf-8?Q?Markus_Jelsma?= To: =?utf-8?Q?solr-user=40lucene=2Eapache=2Eorg?= Date: Thu, 8 Nov 2012 16:13:41 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Priority: 3 (Normal) X-Mailer: Zarafa 7.0.7-34256 Message-Id: X-Virus-Checked: Checked by ClamAV on apache.org Hi, We're testing a large multi lingual index with _LANG fields for each language and using dismax to query them all. Users provide, explicit or implicit, language preferences that we use for either additive or multiplicative boosting on the language of the document. However, additive boosting is not adequate because it cannot overcome the extremely high IDF values for the same word in another language so regardless of the the preference, foreign documents are returned. Multiplicative boosting solves this problem but has the other downside as it doesn't allow us with standard qf=3Dfield^boost to prefer documents in another language above the preferred language because the multiplicative is so strong. We do use the def function (boost=3Ddef(query($qq),.3)) to prevent one boost query to return 0 and thus a product of 0 for all boost queries. But it doesn't help that much This all comes down to IDF differences between the languages, even common words such as country names like `india` show large differences in IDF. Is here anyone with some hints or experiences to share about skewed IDF in such an index=3F Thanks, Markus