Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: domain of deinspanjer@gmail.com
 designates 66.249.92.169 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=nFAfs3IdS7dCwN5tEToTEt8nQQJJ2l5SG69b7Wgbkj8QQk4Ej1ZZhf855YM0byJMe6TchqHBXZHSfuEV7aSE/HkbuC8ds/Vo4PNTo0Co6ruV4KCcxZIka64McZc6BPEthyAEjuJCLvWjVJ0rgOFyfyerJ5dIhq6TSANL8QaifS0=
Message-ID: <f81fa4180704101703x3c48d42ei529da26c6c8132ac@mail.gmail.com>
Date: Tue, 10 Apr 2007 20:03:32 -0400
From: "Daniel Einspanjer" <deinspanjer@gmail.com>
To: java-user@lucene.apache.org
Subject: Ideas for a relevance score that could be considered stable across
 multiple searches with the same query structure?
In-Reply-To: <f81fa4180704100504q142073c2n5ad0e88f0fff7579@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <f81fa4180704100504q142073c2n5ad0e88f0fff7579@mail.gmail.com>

I asked this question on the Solr user list because that is the
current lucene server implementation I'm using, but I didn't get any
feedback there and the problem isn't really Solr specific so I thought
I'd cross post here just in case any non-Solr users might have some
ideas.

Thank you very much for your time,
Daniel

---------- Forwarded message ----------
From: Daniel Einspanjer <deinspanjer@gmail.com>
Date: Apr 10, 2007 8:04 AM
Subject: Ideas for a relevance score that could be considered stable
across multiple searches with the same query structure?
To: solr-user@lucene.apache.org


I did a bit of research on the list for prior discussions of
normalized scores and such.  Please forgive me if I overlooked
something relevant, but I didn't see anything exactly what I'm looking
for.

I am building a replacement for our current text matching engine that
takes a list of documents from feed A and finds the best match for
each of those in the list of documents from feed B.  For purposes of
this example, feed A and B might have the fields:
     title; director; year

The people reviewing this matching process need some way of
determining why a particular match was made other than the overall
score.  Was it because the title was a perfect match or was it because
the title wasn't that close, but the director and year were dead on?

The current idea I have for a strategy to provide this information
would be to run my query four times (n + 1 where n is each scoring
section), once to find the overall best match (a regular query) then
each additional query grouping, requiring, and boosting a different
section of the query. I would then store the rank of the "best" item
returned by the overall query.  That rank can be used to indicate the
relevance of that item based on the defined criteria.

So, following the indexes mentioned above, my queries would be:

The natural "overall" query:
(title:"feed A item one title"^10 (+title:feed~ +title:A~ +title:item~
+title:one~ +title:title~)) director:"Director, Feed A." (year:1974^10
year:[1972 TO 1976])

The query for title relevance:
+((title:"feed A item one title"^10 (+title:feed~ +title:A~
+title:item~ +title:one~ +title:title~)))^100 director:"Director, Feed
A." (year:1974^10 year:[1972 TO 1976])

The query for director relevance:
+(director:"Director, Feed A.")^100 (title:"feed A item one title"^10
(+title:feed~ +title:A~ +title:item~ +title:one~ +title:title~))
(year:1974^10 year:[1972 TO 1976])

The query for year relevance:
+((year:1974^10 year:[1972 TO 1976]))^100 (title:"feed A item one
title"^10 (+title:feed~ +title:A~ +title:item~ +title:one~
+title:title~)) director:"Director, Feed A."

If the #1 item returned by the overall query was 1/10 for title, 3/10
for director, and 5/10 for year and those three scoring sections had
equal weights of 1.0 to .10 then I would be able to display the
following scores:
title: 1.0
director: .8
year: .6
overall: 2.4


I looked at the javadocs related to the FunctionQuery class because it
looked interesting, but the actual docs were a bit light and I wasn't
able to determine if it might help me out with this need.

Does this sound unreasonable to anyone? Is there a clearly better way
I might have overlooked?

Thank you very much for your ideas and comments,

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org