Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Subject: Re: Federated relevance ranking
From: Toke Eskildsen <te@statsbiblioteket.dk>
Reply-To: te@statsbiblioteket.dk
To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
In-Reply-To: <4DE7E9BC.7000304@hms.harvard.edu>
References: <4DE7E9BC.7000304@hms.harvard.edu>
Content-Type: text/plain; charset="UTF-8"
Organization: State and University Library, Denmark
Date: Mon, 6 Jun 2011 11:05:33 +0200
Message-ID: <1307351133.766.299.camel@te-prime>
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit

On Thu, 2011-06-02 at 21:51 +0200, Clint Gilbert wrote:
> We're also considering a home-grown scheme involving normalizing the
> denominators of all the index components in all our indices, based on
> the sums of counts obtained from all the indices.  This feels like
> re-inventing the wheel, and it's not clear to me yet that the low-level
> manipulation of indices that we'd need to do is even possible.

We're currently experimenting with this approach, albeit only for two
searchers. Since we have very little control of the secondary searcher,
just a basic search-API, we're really hacking and performing a query
rewrite based on term statistics. This only works for basic term queries
(no wildcards, ranges etc.), but fortunately our search logs show that
they are by far the most common.

The math is not too bad: Extract occurrence counts for the terms, sum
them, calculate the difference when sending a request to a specific
searcher and set a term boost in the textual query, so that the standard
ranking formula in Lucene will yield the desired score.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org