Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 83FB0200D4C for ; Thu, 30 Nov 2017 17:14:51 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 8267F160BF4; Thu, 30 Nov 2017 16:14:51 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id CA605160BEA for ; Thu, 30 Nov 2017 17:14:50 +0100 (CET) Received: (qmail 91379 invoked by uid 500); 30 Nov 2017 16:14:49 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 91367 invoked by uid 99); 30 Nov 2017 16:14:48 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Nov 2017 16:14:48 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 50617C4008 for ; Thu, 30 Nov 2017 16:14:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.504 X-Spam-Level: ** X-Spam-Status: No, score=2.504 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, KB_WAM_FROM_NAME_SINGLEWORD=0.2, MIME_QP_LONG_LINE=0.001, T_RP_MATCHES_RCVD=-0.01, URI_HEX=1.313] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id CBgkwd2vv8RV for ; Thu, 30 Nov 2017 16:14:46 +0000 (UTC) Received: from mail1.ams.nl.openindex.io (mail1.ams.nl.openindex.io [141.105.125.41]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id D9BDC5F3B7 for ; Thu, 30 Nov 2017 16:14:45 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by mail1.ams.nl.openindex.io (Postfix) with ESMTP id 0865D120E76 for ; Thu, 30 Nov 2017 16:14:39 +0000 (UTC) Received: from mail1.ams.nl.openindex.io ([127.0.0.1]) by localhost (mail1.ams.nl.openindex.io [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id EJAD49quh8nW for ; Thu, 30 Nov 2017 16:14:38 +0000 (UTC) Received: from mail1.ams.nl.openindex.io (localhost [127.0.0.1]) by mail1.ams.nl.openindex.io (Postfix) with ESMTP id DF886120D3B for ; Thu, 30 Nov 2017 16:14:38 +0000 (UTC) Subject: Skewed IDF in multi lingual index, again From: =?utf-8?Q?Markus_Jelsma?= To: =?utf-8?Q?Solr-user?= Date: Thu, 30 Nov 2017 16:14:38 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Priority: 3 (Normal) X-Mailer: Zarafa 7.2.1-51838 X-Original-To: Message-Id: archived-at: Thu, 30 Nov 2017 16:14:51 -0000 Hello, We already discussed this problem five years ago [1]. In short: documents in foreign languages are scored higher for some terms. It was solved back then by using docCount instead of maxDoc when calculating idf, it worked really well! But, probably due to index changes, the problem is back for some terms, mostly proper nouns, well, just like five years ago. We already deboost documents by 0.7 that are not in the user's preference language but in some cases it is not enough. I can go on by reducing that boost but that's not what i prefer. I'd like to know if there are additional tricks to solve the problem. Many thanks! Markus [1] http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html