Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 49074 invoked from network); 27 Jul 2006 16:03:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 27 Jul 2006 16:03:08 -0000 Received: (qmail 93951 invoked by uid 500); 27 Jul 2006 16:03:02 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 93917 invoked by uid 500); 27 Jul 2006 16:03:02 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 93906 invoked by uid 99); 27 Jul 2006 16:03:02 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Jul 2006 09:03:02 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of Russell.Allen@aebn.net designates 216.54.226.11 as permitted sender) Received: from [216.54.226.11] (HELO mail.dataconversions.biz) (216.54.226.11) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Jul 2006 09:03:01 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: Scoring a document (count?) Date: Thu, 27 Jul 2006 12:02:46 -0400 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Scoring a document (count?) Thread-Index: AcaxlhowK3Oq5LVzQD+1Mg77da8N8Q== From: "Russell M. Allen" To: X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I am curious about the potential use of document scoring as a means to extract additional data from an index. Specifically, I would like the score to be a count of how many times a particular field matched a set of terms. =20 For example, I am indexing movie-stars (Each document is a movie-star). A movie-star has a number of fields, such as name, movies they have been in, etc. I want to produce an 'index' of stars by name and show how many movies, which match a filter, that they have appeared in. In natural language my query might be:=20 "List all stars who have appeared in a 'horror' movie, where last name starts with A, and tell me how many horror movies they were in." My search will look something like this: =20 "+lastName:A* +movie:(1 7 21 58 92)" //where movie is a previously computed list of 'horror' movie ids If my index contained the following documents: doc1 =3D lastName:Anna movie:{3 10} doc2 =3D lastName:Aba movie:{1 10 12} doc3 =3D lastName:Addd movie:{3 21 55 92} doc4 =3D lastName:Baaa movie:{7 56} I would like to get back: doc2, score of 1 //score of 1 because only movie 1 matched doc3, score of 2 //score of 2 because movies 21 and 92 matched Currently, we perform an initial query against our Star index to retrieve a list of stars. Then we perform N queries against a separate movie index to count the number of movies that match our sub filter 'horror'. This is obviously very inefficient, and as I've shown above, the information (count) is available during the primary query. Thoughts? --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org