Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3C5639C6D for ; Fri, 9 Mar 2012 10:40:47 +0000 (UTC) Received: (qmail 74618 invoked by uid 500); 9 Mar 2012 10:40:45 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 74530 invoked by uid 500); 9 Mar 2012 10:40:44 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 74518 invoked by uid 99); 9 Mar 2012 10:40:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Mar 2012 10:40:44 +0000 X-ASF-Spam-Status: No, hits=4.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FORGED_REPLYTO,FREEMAIL_REPLYTO_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of paul_t100@fastmail.fm designates 66.111.4.29 as permitted sender) Received: from [66.111.4.29] (HELO out5-smtp.messagingengine.com) (66.111.4.29) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Mar 2012 10:40:38 +0000 Received: from compute4.internal (compute4.nyi.mail.srv.osa [10.202.2.44]) by gateway1.nyi.mail.srv.osa (Postfix) with ESMTP id 722682115E for ; Fri, 9 Mar 2012 05:40:17 -0500 (EST) Received: from frontend1.nyi.mail.srv.osa ([10.202.2.160]) by compute4.internal (MEProxy); Fri, 09 Mar 2012 05:40:17 -0500 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=fastmail.fm; h= message-id:date:from:reply-to:mime-version:to:subject :content-type; s=mesmtp; bh=8CO9JyA6mdF6cmx7yWgKihfdxiY=; b=EUR6 hfG+K8sHefYmTSIiJm9aUX9X0nlyGr0E+IeRCF0HlNSc174bw6tcWSLz7LWbjbhH Zi1aCoS2E77KAaaSSy5+TJ5qqCLZdrcqxxeNatJiAGADTryJB2y+xqfqZ6WRiVYg fZHtGBRaEt7KyWfB3+B/d5ec5oZMwa/ws1LPNvQ= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=message-id:date:from:reply-to :mime-version:to:subject:content-type; s=smtpout; bh=8CO9JyA6mdF 6cmx7yWgKihfdxiY=; b=ohBMJ8OCeyGdCgfLViu/OVNplvN+wIFpX0kpDPhcazd U5UxuFMepvv+0HUnWEEYIvh8HJt5O4dR+S+U16DKbPa3IxhBjYvpcXo6jT4wgFet cGRqfUwZSZip/N/g9Dj306LATgPJcavGg9+I0Fbnz/Ug4Ftp4Q5YnQy6nTTP0MIo = X-Sasl-enc: ZfCGb5tMhJeWLRF3Wh7gxOC96u9YLw5feuGHcmU8LUp4 1331289616 Received: from macbook-2.local (unknown [217.155.98.246]) by mail.messagingengine.com (Postfix) with ESMTPSA id 9176C8E0260 for ; Fri, 9 Mar 2012 05:40:16 -0500 (EST) Message-ID: <4F59DE87.10507@fastmail.fm> Date: Fri, 09 Mar 2012 10:42:15 +0000 From: Paul Taylor Reply-To: paul_t100@fastmail.fm User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2 MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: There is a mismatch between the score for a wildcard match and an exact match Content-Type: multipart/alternative; boundary="------------000206000706010204000808" X-Virus-Checked: Checked by ClamAV on apache.org --------------000206000706010204000808 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit There is a mismatch between the score for a wildcard match and an exact match I search for |recording:live OR recording:luve* | And here is the Explain Output from Search |DocNo:0:1.4196585:11111111-1cf0-4d1f-aca7-2a6f89e34b36 1.4196585 = (MATCH) max plus0.1 times others of: 0.3763506 = (MATCH) ConstantScore(recording:luve*), product of: 1.0 = boost 0.3763506 = queryNorm 1.3820235 = (MATCH) weight(recording:luve in0), product of: 0.7211972 = queryWeight(recording:luve), product of: 1.9162908 = idf(docFreq=1, maxDocs=5) 0.3763506 = queryNorm 1.9162908 = (MATCH) fieldWeight(recording:luve in0), product of: 1.0 = tf(termFreq(recording:luve)=1) 1.9162908 = idf(docFreq=1, maxDocs=5) 1.0 = fieldNorm(field=recording, doc=0) DocNo:1:0.3763506:22222222-1cf0-4d1f-aca7-2a6f89e34b36 0.3763506 = (MATCH) max plus0.1 times others of: 0.3763506 = (MATCH) ConstantScore(recording:luve*), product of: 1.0 = boost 0.3763506 = queryNorm | In my test I have 5 documents one contains an exact match, another a wildcard match and the other three do not match all. The score of the exact match is *1.4* compared to *0.37* for the wildcard match, thats nearly a factor of *4*. With a much larger index the score for an exact match on a rare term compared to a wildcard search would be even higher. The whole difference is due to the different scoring mechism used for wildcard to exact match, wildcards don't take tf/idf or lengthnorm into account you just get a constant score for each match. Now I'm not bothered about tf or lengthnorm in my data domain it doesnt make much difference but the *idf* score is a real killer. Because the matching doc is found once in 5 documents its idf contribution is idf squared i.e *3.61* I know this constant score is quicker than calculating the tf*idf*lengthnorm for each wildcard match but it doesn't make sense to me for the idf to contribute so much to the score. I also know I can change the rewrite method but there are two problems with this. 1. Scoring rewrite methods perform less well because they are calculating idf, tf and lengthnorm. idf is the only value I need. 2. Ones that do calculate the score dont make much sense either as they would calculate the idf of the matching term even though this isn't what was actually search for and this term could be rarer than what I was actually searching for, possibly boosting it higher than the exact match. (I could also change the similarity class to override the idf calculation so it always returns 1 but that doesn't make sense because the idf is very useful for comparing exact matches to different words i.e recording:luve OR recording:luve* OR recording:the OR recording:the* I would want matches to *luve* to score higher than matches to the common word *the* ) So does a rewrite method already exist or is possible for it to just calculate the idf of the term it was trying to match to so for example in this case I search for 'luve' and the wildcard matches on 'luvely' that it would multiple the luvely match by the idf of luve (3.61). This way my wildcard match would be comparable to the exact match and I can just change my query to boost the exact match slightly so exact match would always score higher than wildcard match but not too much higher i.e |recording:live^1.2 OR recording:luve* | and with this mythical rewrite method this would give (depending on queryNorm): * Doc 0:0:1.692 * Doc 1:0:1.419 --------------000206000706010204000808--