Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 20184 invoked from network); 9 Mar 2008 18:44:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 9 Mar 2008 18:44:52 -0000 Received: (qmail 9426 invoked by uid 500); 9 Mar 2008 18:44:41 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 9390 invoked by uid 500); 9 Mar 2008 18:44:41 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 9379 invoked by uid 99); 9 Mar 2008 18:44:41 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 09 Mar 2008 11:44:41 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of markrmiller@gmail.com designates 72.14.204.224 as permitted sender) Received: from [72.14.204.224] (HELO qb-out-0506.google.com) (72.14.204.224) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 09 Mar 2008 18:44:04 +0000 Received: by qb-out-0506.google.com with SMTP id o21so1359490qba.9 for ; Sun, 09 Mar 2008 11:44:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding; bh=47IZw42jKQ3BGmur5Mt5vol7afbo0TbvrJL2F/bHCmY=; b=V6goSbYXX/0mjyI+Z5uaZQPUAMrr4uosq7lYUc+zrQqWLwl1eSOYeUzd7nr1L6sTgu8wuVXfN8fb1+in/BRAkAxtC3F2D0JY79/vCYc44FQ8G5sFTawU5B85GemcB3LQQHydBkj0vCrX/+AT0L7C8Gf05XSQnEoXxUEhxR52/RA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding; b=B8fGAwA6a7oJ5r/ZrL1DdUceSAlgV2WMfkP9wirsUxYlOH1jLnLWpudMPJ8mp6LneHFDB3DZFmVw59YBL3Nw+Xi2khNjGSMPtzTsXfhlpQxEEWCn3pmCENP6qT1SrXGOyE66DWIACVFzun11clZc1T39nikfeCYny2xo1Jesf7U= Received: by 10.65.213.4 with SMTP id p4mr7044893qbq.83.1205088253921; Sun, 09 Mar 2008 11:44:13 -0700 (PDT) Received: from ?192.168.1.100? ( [69.124.234.149]) by mx.google.com with ESMTPS id f14sm7033766qba.25.2008.03.09.11.44.12 (version=SSLv3 cipher=RC4-MD5); Sun, 09 Mar 2008 11:44:13 -0700 (PDT) Message-ID: <47D42EE3.9040808@gmail.com> Date: Sun, 09 Mar 2008 14:39:31 -0400 From: Mark Miller User-Agent: Thunderbird 3.0a1pre (Windows/2008030804) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Scoring a query with OR's References: <300114.21874.qm@web26510.mail.ukl.yahoo.com> <359a92830803081700j69b0ee4bkd400ef483a064075@mail.gmail.com> <0d6001c88204$362f78e0$0602a8c0@IBM3D2E684396F> <375F38A3-7AF6-4AE6-9586-770AF9A35033@ehatchersolutions.com> <0d9301c8820c$627c4970$0602a8c0@IBM3D2E684396F> In-Reply-To: <0d9301c8820c$627c4970$0602a8c0@IBM3D2E684396F> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org I have been trying to understand all of this better myself, so while I am no expert, here is my take: Lucene is really a combined Vector Space / Boolean Model search engine. At its core, Lucene is essentially a Vector Space Model search engine: scoring is done by comparing a query term vector to each of the document term vectors. However, on top of this, Lucene allows a Boolean Model by constraining results using a BooleanQuery. So when Lucene finds the score for "mark OR mandy", the idea is the same as for "mark AND mandy". The difference is that the BooleanQuery will treat the Must and Should clause differently: if a term is labeled Must but is not in the document, the document won't match. If a Should term is not in the document, the BooleanQuery excludes no extra documents on that account, but the term may contribute 0 towards the similarity score. The BooleanQuery kind of clamps down on top of the Vector Space TermVector similarity scoring, allowing for a hybrid system. The coord factor essentially juices the term vector similarity score based on how many query terms are in the document. Term overlap is already taken into account during the term vector similarity part, but apparently users don't like how that ranks eg users intuitively think that sharing more terms between document and query is more important than sharing fewer very highly weighted terms. So basically, coord is just trying to reorder things a bit based on reported user expectations. - Mark Ghinwa Choueiter wrote: > but shouldn't the coord factor kick in with AND instead of OR? I > understand why you would want to use coord in the case of AND, where > you reward more the documents that contain most of the terms in the > query. However in the case of OR, it should not matter if all the OR > operands are in the document? > > -Ghinwa > > ----- Original Message ----- From: "Erik Hatcher" > > To: > Sent: Sunday, March 09, 2008 1:22 PM > Subject: Re: Scoring a query with OR's > > >> >> On Mar 9, 2008, at 12:39 PM, Ghinwa Choueiter wrote: >>> but what exactly happens when there are OR's, for eg. (life OR >>> place OR time) >>> >>> The scoring equation can get a score for life, place, time >>> separately, but what does it do with them then? Does it also add them. >> >> The coord factor kicks in then: >> >> > apache/lucene/search/DefaultSimilarity.html#coord(int,%20int)> >> >> the formula listed here should help too: >> >> > apache/lucene/search/Similarity.html> >> >> Erik >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org