Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 81749 invoked from network); 17 Dec 2004 21:28:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 17 Dec 2004 21:28:27 -0000 Received: (qmail 97645 invoked by uid 500); 17 Dec 2004 21:28:11 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 97607 invoked by uid 500); 17 Dec 2004 21:28:11 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 97517 invoked by uid 99); 17 Dec 2004 21:28:09 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (hermes.apache.org: local policy) Received: from rwcrmhc11.comcast.net (HELO rwcrmhc11.comcast.net) (204.127.198.35) by apache.org (qpsmtpd/0.28) with ESMTP; Fri, 17 Dec 2004 13:27:03 -0800 Received: from [192.168.168.81] (c-24-5-160-217.client.comcast.net[24.5.160.217]) by comcast.net (rwcrmhc11) with ESMTP id <2004121721263201300fv6tfe>; Fri, 17 Dec 2004 21:26:32 +0000 Message-ID: <41C34F07.9000009@apache.org> Date: Fri, 17 Dec 2004 13:26:31 -0800 From: Doug Cutting User-Agent: Mozilla Thunderbird 0.9 (X11/20041127) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Developers List Subject: DefaultSimilarity 2.0? References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Chuck Williams wrote: > Another issue will likely be the tf() and idf() computations. I have a > similar desired relevance ranking and was not getting what I wanted due > to the idf() term dominating the score. [ ... ] Chuck has made a series of criticisms of the DefaultSimilarity implementation. Unfortunately it is difficult to quickly evaluate these, as it requires relevance judgements. But, still, we should consider modifying DefaultSimilarity for the 2.0 release if there are easy improvements to be had. But how do we decide what's better? Perhaps we should perform a formal or semi-formal evaluation of various Similarity implementations on a reference collection. For example, for a formal evalution we might use one the TREC Web collections, which have associated queries and relevance judgements. Or, less formally, we could use a crawl of the ~5M pages in DMOZ (I would be glad to collect these using Nutch). This could work as follows: -- Different folks could download and index a reference collection, offering demonstration search systems. We would provide complete code. These would differ only in their Similarity implementation. All implementations would use the same Analyzer and search only a single field. -- These folks could then announce their candiate implementations and let others run queries against them, via HTTP. Different Similarity implementations could thus be publicly and interactively compared. -- Hopefully a consensus, or at least a healthy majority, would agree on which was the best implementation and we could make that the default for Lucene 2.0. Are there folks (e.g., Chuck) who would be willing to play this game? Should we make it more formal, using, e.g., TREC? Does anyone have other ideas how we should decide how to modify DefaultSimilarity? Doug --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org