Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 54303 invoked from network); 30 Jun 2009 18:46:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 30 Jun 2009 18:46:18 -0000 Received: (qmail 42832 invoked by uid 500); 30 Jun 2009 18:46:26 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 42756 invoked by uid 500); 30 Jun 2009 18:46:26 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 42746 invoked by uid 99); 30 Jun 2009 18:46:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Jun 2009 18:46:26 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [209.222.209.11] (HELO mirkwood.informatics.jax.org) (209.222.209.11) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Jun 2009 18:46:17 +0000 Received: from [127.0.0.1] (corona [209.222.209.245]) by mirkwood.informatics.jax.org (8.14.2/8.14.2) with ESMTP id n5UIjrMN019945 for ; Tue, 30 Jun 2009 14:45:54 -0400 (EDT) (envelope-from mhall@informatics.jax.org) Message-ID: <4A4A5D61.2080508@informatics.jax.org> Date: Tue, 30 Jun 2009 14:45:53 -0400 From: Matthew Hall User-Agent: Thunderbird 2.0.0.22 (Windows/20090605) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Query which gives high score proportional to 'distinct term matches' References: <24276724.post@talk.nabble.com> In-Reply-To: <24276724.post@talk.nabble.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-PMX-Version: 5.5.4.371499, Antispam-Engine: 2.7.1.369594, Antispam-Data: 2009.6.30.183625 X-PerlMx-Spam: Gauge=IIIIIIII, Probability=8%, Report=' BODY_SIZE_1900_1999 0, BODY_SIZE_2000_LESS 0, BODY_SIZE_5000_LESS 0, BODY_SIZE_7000_LESS 0, TO_NO_NAME 0, __BOUNCE_CHALLENGE_SUBJ 0, __C230066_P5 0, __CT 0, __CTE 0, __CT_TEXT_PLAIN 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __MOZILLA_MSGID 0, __SANE_MSGID 0, __TO_MALFORMED_2 0, __USER_AGENT 0' X-Virus-Checked: Checked by ClamAV on apache.org Well, we have a very similar requirement here, but for us its for every single field that we wanted this kind of behavior. We got this in by eliminating the TF (Term Frequency) contribution to score via a custom Similarity. (Which is very easy to do.) I... think in the newer versions of lucene you can omit TF more programatically at query time, but I don't recall if you could do it on a per field basis. Anyone else want to speak on this a bit better? Matt chandrakant k wrote: > I have a index which has got fields like > > title : > content : > > If I search for, lets say obama fly , then the documents having obama and > fly should be given high scores irrespective of the number of times they may > occur. This requirement is for fields - title and content. > > The implementation which I did with a simple OR query will score high the > documents for e.g. > having more occurrence of 'obama' even if it has no occurrence 'fly' word > in it. The tf for 'obama' here in this case is more; so even if 'fly' word > is not present the document is scored higher. > > Expected behaviour is that - > (a) documents having 'obama' and 'fly' both should be scored higher in > order of their tf . > (b) documents having either of terms should be given scores but less than > those matched in (a) > > I tried by overiding the the coord() in a Custom Similarity implementation > and boosting it if multiple terms match, but what I see is that coord() is > gets boosted even if same word matches in multiple fields (say obama is > present in title: and content: ). > > Searching for solutions, I have not got any results which talk about similar > requirement... I guess I am not using right keywords.... > > Thanks > Chandrakant K. > > > > > > -- Matthew Hall Software Engineer Mouse Genome Informatics mhall@informatics.jax.org (207) 288-6012 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org