Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: domain of Russell.Allen@aebn.net
 designates 216.54.226.11 as permitted sender)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: Scoring a document (count?)
Date: Thu, 27 Jul 2006 12:02:46 -0400
Message-ID: 
 <E9BF7E0FD95D1143B399B129908A58FF01582804@datachange01.dataconversions.biz>
Thread-Topic: Scoring a document (count?)
Thread-Index: AcaxlhowK3Oq5LVzQD+1Mg77da8N8Q==
From: "Russell M. Allen" <Russell.Allen@aebn.net>
To: <java-user@lucene.apache.org>

I am curious about the potential use of document scoring as a means to
extract additional data from an index.  Specifically, I would like the
score to be a count of how many times a particular field matched a set
of terms.
=20
For example, I am indexing movie-stars (Each document is a movie-star).
A movie-star has a number of fields, such as name, movies they have been
in, etc.  I want to produce an 'index' of stars by name and show how
many movies, which match a filter, that they have appeared in.

In natural language my query might be:=20
	"List all stars who have appeared in a 'horror' movie, where
last name starts with A, and tell me how many horror movies they were
in."

My search will look something like this: =20
	"+lastName:A* +movie:(1 7 21 58 92)"	//where movie is a
previously computed list of 'horror' movie ids

If my index contained the following documents:
    doc1 =3D lastName:Anna   movie:{3 10}
    doc2 =3D lastName:Aba    movie:{1 10 12}
    doc3 =3D lastName:Addd   movie:{3 21 55 92}
    doc4 =3D lastName:Baaa   movie:{7 56}

I would like to get back:
    doc2, score of 1	//score of 1 because only movie 1 matched
    doc3, score of 2	//score of 2 because movies 21 and 92 matched


Currently, we perform an initial query against our Star index to
retrieve a list of stars.  Then we perform N queries against a separate
movie index to count the number of movies that match our sub filter
'horror'.  This is obviously very inefficient, and as I've shown above,
the information (count) is available during the primary query.

Thoughts?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org