Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Received-SPF: pass (hermes.apache.org: local policy)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable
Subject: RE: DefaultSimilarity 2.0?
Date: Mon, 20 Dec 2004 14:28:50 -0800
Message-ID: 
 <E3381E0825F1954D953A4E347017BB1C02CC6E3D@reh001-1.REX001.ExchangeByRegister.com>
Thread-Topic: DefaultSimilarity 2.0?
Thread-Index: AcTmullStfIAqUeVSMK3435WJjpmYAADpRBgAAHXK7AABGjhoA==
From: "Chuck Williams" <chuck@manawiz.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>

I believe our objective in this test is to find the best
DefaultSimilarity for Lucene.  I'd like to extend it to also include
finding the best approach to MultiFieldQueryParser.  We can keep the two
tests separate, but I'd like to get double-duty out of the core effort
to set up a test and evaluation environment and process.  More detailed
changes to Lucene should probably be excluded from this particular test.

I'm planning to "enter" the Similarity I'm using and the
DistributingMultiFieldQueryParser/MaxDisjunctionQuery that I've already
posted into Bugzilla
(http://issues.apache.org/bugzilla/show_bug.cgi?id=3D32674).  I'm not
viewing this as a "competition" in the sense that my objective is not to
win.  I'm planning on doing little or no specific tuning for the corpus,
both because of the problem Joaquin cites and because I don't have the
time.

>From the standpoint of finding the best defaults to ship with Lucene, I
agree that testing against multiple corpuses would be desirable.

Chuck

  > -----Original Message-----
  > From: Joaquin Delgado [mailto:joaquin@triplehop.com]
  > Sent: Monday, December 20, 2004 12:37 PM
  > To: Lucene Developers List
  > Subject: RE: DefaultSimilarity 2.0?
  >=20
  > I understand that not all the vector-space similarity calculation is
  > contained within the similarity class (where only factors and their
  > values are defined). Will the contestants be allowed to modify any
  > relevant classes/methods to improve the relevance quality?
  >=20
  > By experience, using only one collection of TREC or other benchmark
text
  > corpus induces tailoring the algorithms to the corpus. To be fair we
  > should run the benchmarks against multiple collections and average
  > recall/precision.
  >=20
  > -- Joaquin Delgado
  >=20
  > -----Original Message-----
  > From: Chuck Williams [mailto:chuck@manawiz.com]
  > Sent: Monday, December 20, 2004 2:25 PM
  > To: Lucene Developers List
  > Subject: RE: DefaultSimilarity 2.0?
  >=20
  > I agree it makes sense to isolate variables for analysis and
comparison.
  > It also would seem that we should get as much benefit out of this
  > exercise as possible.  So, how about multi-field docs with multiple
  > query test sets?   One test set (or more) could have only
single-field
  > queries.  A simple way to do this might be to have three fields on
the
  > documents:  title, body, and all (=3D title+body).  We could have =
just
one
  > set of queries that were run twice with a different parser (parsing
into
  > "all", or parsing into "title" and "body").  That would provide
another
  > interesting comparison -- a determination of whether or not
  > field-specific boosting is a benefit.
  >=20
  > Chuck
  >=20
  >   > -----Original Message-----
  >   > From: Doug Cutting [mailto:cutting@apache.org]
  >   > Sent: Monday, December 20, 2004 9:34 AM
  >   > To: Lucene Developers List
  >   > Subject: Re: DefaultSimilarity 2.0?
  >   >
  >   > Chuck Williams wrote:
  >   > > Finally, I'd suggest picking content that has multiple fields
and
  >   > allow
  >   > > the individual implementations to decide how to search these
  > fields --
  >   > > just title and body would be enough.  I would like to use my
  >   > > MaxDisjunctionQuery and see how it compares to other
approaches
  > (e.g.,
  >   > > the default MultiFieldQueryParser, assuming somebody uses that
in
  > this
  >   > > test).
  >   >
  >   > I think that would be a good contest too, but I'd rather first
just
  >   > focus on the ranking of single-field queries.  There are a
number of
  >   > issues that come up with multi-field queries that I'd rather
  > postpone in
  >   > order to reduce the number of variables we test at one time.
  >   >
  >   > Doug
  >   >
  >   >
  >
---------------------------------------------------------------------
  >   > To unsubscribe, e-mail:
lucene-dev-unsubscribe@jakarta.apache.org
  >   > For additional commands, e-mail:
lucene-dev-help@jakarta.apache.org
  >=20
  >=20
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  >=20
  >=20
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org