Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 59216 invoked from network); 20 Dec 2004 22:29:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 20 Dec 2004 22:29:01 -0000 Received: (qmail 11942 invoked by uid 500); 20 Dec 2004 22:28:57 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 11915 invoked by uid 500); 20 Dec 2004 22:28:57 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 11899 invoked by uid 99); 20 Dec 2004 22:28:57 -0000 X-ASF-Spam-Status: No, hits=1.0 required=10.0 tests=SPF_HELO_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from reh001-1.rex001.exchangebyregister.com (HELO reh001-1.REX001.ExchangeByRegister.com) (64.78.19.14) by apache.org (qpsmtpd/0.28) with ESMTP; Mon, 20 Dec 2004 14:28:53 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: quoted-printable Subject: RE: DefaultSimilarity 2.0? Date: Mon, 20 Dec 2004 14:28:50 -0800 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: DefaultSimilarity 2.0? Thread-Index: AcTmullStfIAqUeVSMK3435WJjpmYAADpRBgAAHXK7AABGjhoA== From: "Chuck Williams" To: "Lucene Developers List" X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N I believe our objective in this test is to find the best DefaultSimilarity for Lucene. I'd like to extend it to also include finding the best approach to MultiFieldQueryParser. We can keep the two tests separate, but I'd like to get double-duty out of the core effort to set up a test and evaluation environment and process. More detailed changes to Lucene should probably be excluded from this particular test. I'm planning to "enter" the Similarity I'm using and the DistributingMultiFieldQueryParser/MaxDisjunctionQuery that I've already posted into Bugzilla (http://issues.apache.org/bugzilla/show_bug.cgi?id=3D32674). I'm not viewing this as a "competition" in the sense that my objective is not to win. I'm planning on doing little or no specific tuning for the corpus, both because of the problem Joaquin cites and because I don't have the time. >From the standpoint of finding the best defaults to ship with Lucene, I agree that testing against multiple corpuses would be desirable. Chuck > -----Original Message----- > From: Joaquin Delgado [mailto:joaquin@triplehop.com] > Sent: Monday, December 20, 2004 12:37 PM > To: Lucene Developers List > Subject: RE: DefaultSimilarity 2.0? >=20 > I understand that not all the vector-space similarity calculation is > contained within the similarity class (where only factors and their > values are defined). Will the contestants be allowed to modify any > relevant classes/methods to improve the relevance quality? >=20 > By experience, using only one collection of TREC or other benchmark text > corpus induces tailoring the algorithms to the corpus. To be fair we > should run the benchmarks against multiple collections and average > recall/precision. >=20 > -- Joaquin Delgado >=20 > -----Original Message----- > From: Chuck Williams [mailto:chuck@manawiz.com] > Sent: Monday, December 20, 2004 2:25 PM > To: Lucene Developers List > Subject: RE: DefaultSimilarity 2.0? >=20 > I agree it makes sense to isolate variables for analysis and comparison. > It also would seem that we should get as much benefit out of this > exercise as possible. So, how about multi-field docs with multiple > query test sets? One test set (or more) could have only single-field > queries. A simple way to do this might be to have three fields on the > documents: title, body, and all (=3D title+body). We could have = just one > set of queries that were run twice with a different parser (parsing into > "all", or parsing into "title" and "body"). That would provide another > interesting comparison -- a determination of whether or not > field-specific boosting is a benefit. >=20 > Chuck >=20 > > -----Original Message----- > > From: Doug Cutting [mailto:cutting@apache.org] > > Sent: Monday, December 20, 2004 9:34 AM > > To: Lucene Developers List > > Subject: Re: DefaultSimilarity 2.0? > > > > Chuck Williams wrote: > > > Finally, I'd suggest picking content that has multiple fields and > > allow > > > the individual implementations to decide how to search these > fields -- > > > just title and body would be enough. I would like to use my > > > MaxDisjunctionQuery and see how it compares to other approaches > (e.g., > > > the default MultiFieldQueryParser, assuming somebody uses that in > this > > > test). > > > > I think that would be a good contest too, but I'd rather first just > > focus on the ranking of single-field queries. There are a number of > > issues that come up with multi-field queries that I'd rather > postpone in > > order to reduce the number of variables we test at one time. > > > > Doug > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org > > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org