Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 47D99200BCE for ; Fri, 18 Nov 2016 03:21:43 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 465B3160B0F; Fri, 18 Nov 2016 02:21:43 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8D92D160B0B for ; Fri, 18 Nov 2016 03:21:42 +0100 (CET) Received: (qmail 68120 invoked by uid 500); 18 Nov 2016 02:21:41 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 68108 invoked by uid 99); 18 Nov 2016 02:21:40 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Nov 2016 02:21:40 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 7F7D21A036A for ; Fri, 18 Nov 2016 02:21:40 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.679 X-Spam-Level: * X-Spam-Status: No, score=1.679 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id QBL0WSWRor_M for ; Fri, 18 Nov 2016 02:21:37 +0000 (UTC) Received: from mail-qk0-f179.google.com (mail-qk0-f179.google.com [209.85.220.179]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 1D8625F250 for ; Fri, 18 Nov 2016 02:21:37 +0000 (UTC) Received: by mail-qk0-f179.google.com with SMTP id n204so247025573qke.2 for ; Thu, 17 Nov 2016 18:21:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to; bh=rA5McwRy731azEgvBog48aqFg3RKBJPFjT6QxHzaq0o=; b=tfAduWb7BcrqHtFYMH9GtqqI3ZKUhjysMzdvAEAai/beBuxGv+U+a2caZZ/JuUuWd6 pvU9MDY1CTcjFKTurV656ZovGgXby73OPvcdjAAwGKYbQDzMfSo5a765ZURRHmXSD4IS kltI1QKfpVNZHTVmLEfpPEuptqi9NaiX96IMLGqzHKKF2Wi6yBYk0KmFt2cTMtJZSJnZ ZUSFGOKcJHDknGVqc9dW7DWicjkDsk//lcz+cj2ajkKM55StGLOiDoVoJ6O05EQ0zLwC n7o+vJVrXcH12wK/PUFyFf2ZguVJvKSEnkqawtRu8dUjsv1odDXd8QAyxIeqoym45FG4 2KJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to; bh=rA5McwRy731azEgvBog48aqFg3RKBJPFjT6QxHzaq0o=; b=DB7sIHjblzSaJxHL3bR+gHhwLFbUbX0wp4RaR4a0ZmrAsrPc8yMOieOkF9/QlJ81bO HYHjeRMfDiFbzVnoZLVCIGxuAdumRYbA8BM3fOzot18XZ3+BRI+Ar2IXnrDOE6/lHxlF 3w8iP5n/Lr8C6ORifuyK4wCbckTrothWJhTppX8Zw/CPyTTE1m5kDp0JtGHsMuhSS9+W TgV/0sBSS7xfblxLYKD30MXI7eD1qC4VHHkb8o6vvHizKpXxO1hGsDq+QcAggHcYp+Z4 Jbv36GkTtYEKpE8OyPEmi7yFLzoc+YKdMjQdcpqZx6JWaWk/8n9OI0SiGrYvQEaY28UQ 5PKQ== X-Gm-Message-State: AKaTC01NwsG+yX/oAMC+whzA3nqEpFVjSUo6qXKZ+jquF0srj4Ymeq8rDLxaPOcoxPPW6A== X-Received: by 10.55.72.22 with SMTP id v22mr7408823qka.50.1479435693749; Thu, 17 Nov 2016 18:21:33 -0800 (PST) Received: from [192.168.2.6] ([64.79.53.108]) by smtp.gmail.com with ESMTPSA id o31sm2966495qtf.20.2016.11.17.18.21.32 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 17 Nov 2016 18:21:33 -0800 (PST) Subject: Re: Multi-field IDF To: java-user@lucene.apache.org References: <7318cff8-9275-a525-c0ea-e04b1e7dcb42@wolfram.com> <897856915.2053641.1479417908846@mail.yahoo.com> <89f7dac8-4c08-7e57-fd5d-d546d57cfc3b@wolfram.com> From: Will Martin Message-ID: <51571211-1b4e-9de9-fb0c-8b39034a31b2@gmail.com> Date: Thu, 17 Nov 2016 21:21:30 -0500 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <89f7dac8-4c08-7e57-fd5d-d546d57cfc3b@wolfram.com> Content-Type: multipart/alternative; boundary="------------53EB7D1BB181DA9D03A937B1" archived-at: Fri, 18 Nov 2016 02:21:43 -0000 --------------53EB7D1BB181DA9D03A937B1 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit are you familiar with pivoted normalized document length practice or theory? or croft's recent work on relevance algorithms accounting for structured field presence? On 11/17/2016 5:20 PM, Nicolás Lichtmaier wrote: > That depends on what you want. In this case I want to use a > discrimination power based in all the body text, not just the titles. > Because otherwise terms that are really not that relevant end up being > very high! > > > El 17/11/16 a las 18:25, Ahmet Arslan escribió: >> Hi Nicholas, >> >> IDF, among others, is a measure of term specificity. If 'or' is not >> so usual in titles, then it has some discrimination power in that >> domain. >> >> I think it's OK 'or' to get a high IDF value in this case. >> >> Ahmet >> >> >> >> On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier >> wrote: >> IDF measures the selectivity of a term. But the calculation is >> per-field. That can be bad for very short fields (like titles). One >> example of this problem: If I don't delete stop words, then "or", "and", >> etc. should be dealt with low IDF values, however "or" is, perhaps, not >> so usual in titles. Then, "or" will have a high IDF value and be treated >> as an important term. That's bad. >> >> One solution I see is to modify the Similarity to have a global, or >> multi-field IDF value. This value would include in its calculation >> longer fields that has more "normal text"-like stats. However this is >> not trivial because I can't just add document-frequencies (I would be >> counting some documents several times if "or" is present in more than >> one field). I would need need to OR the bit-vectors that signal the >> presence of the term, right? Not trivial. >> >> Has anyone encountered this issue? Has it been solved? Is my thinking >> wrong? >> >> Should I also try the developers' list? >> >> Thanks! >> >> Nicolás.- >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------53EB7D1BB181DA9D03A937B1--