Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Received-SPF: neutral (hermes.apache.org: local policy)
Message-ID: <41C34F07.9000009@apache.org>
Date: Fri, 17 Dec 2004 13:26:31 -0800
From: Doug Cutting <cutting@apache.org>
User-Agent: Mozilla Thunderbird 0.9 (X11/20041127)
MIME-Version: 1.0
To: Lucene Developers List <lucene-dev@jakarta.apache.org>
Subject: DefaultSimilarity 2.0?
References: 
 <E3381E0825F1954D953A4E347017BB1C02CC6A3B@reh001-1.REX001.ExchangeByRegister.com>
In-Reply-To: 
 <E3381E0825F1954D953A4E347017BB1C02CC6A3B@reh001-1.REX001.ExchangeByRegister.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Chuck Williams wrote:
> Another issue will likely be the tf() and idf() computations.  I have a
> similar desired relevance ranking and was not getting what I wanted due
> to the idf() term dominating the score. [ ... ]

Chuck has made a series of criticisms of the DefaultSimilarity 
implementation.  Unfortunately it is difficult to quickly evaluate 
these, as it requires relevance judgements.  But, still, we should 
consider modifying DefaultSimilarity for the 2.0 release if there are 
easy improvements to be had.  But how do we decide what's better?

Perhaps we should perform a formal or semi-formal evaluation of various 
Similarity implementations on a reference collection.  For example, for 
a formal evalution we might use one the TREC Web collections, which have 
associated queries and relevance judgements.  Or, less formally, we 
could use a crawl of the ~5M pages in DMOZ (I would be glad to collect 
these using Nutch).

This could work as follows:
   -- Different folks could download and index a reference collection, 
offering demonstration search systems.  We would provide complete code. 
  These would differ only in their Similarity implementation.  All 
implementations would use the same Analyzer and search only a single field.
   -- These folks could then announce their candiate implementations and 
let others run queries against them, via HTTP.  Different Similarity 
implementations could thus be publicly and interactively compared.
   -- Hopefully a consensus, or at least a healthy majority, would agree 
on which was the best implementation and we could make that the default 
for Lucene 2.0.

Are there folks (e.g., Chuck) who would be willing to play this game? 
Should we make it more formal, using, e.g., TREC?  Does anyone have 
other ideas how we should decide how to modify DefaultSimilarity?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org