lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joaquin Delgado" <joaq...@triplehop.com>
Subject RE: DefaultSimilarity 2.0?
Date Mon, 20 Dec 2004 20:37:19 GMT
I understand that not all the vector-space similarity calculation is
contained within the similarity class (where only factors and their
values are defined). Will the contestants be allowed to modify any
relevant classes/methods to improve the relevance quality?

By experience, using only one collection of TREC or other benchmark text
corpus induces tailoring the algorithms to the corpus. To be fair we
should run the benchmarks against multiple collections and average
recall/precision.

-- Joaquin Delgado

-----Original Message-----
From: Chuck Williams [mailto:chuck@manawiz.com] 
Sent: Monday, December 20, 2004 2:25 PM
To: Lucene Developers List
Subject: RE: DefaultSimilarity 2.0?

I agree it makes sense to isolate variables for analysis and comparison.
It also would seem that we should get as much benefit out of this
exercise as possible.  So, how about multi-field docs with multiple
query test sets?   One test set (or more) could have only single-field
queries.  A simple way to do this might be to have three fields on the
documents:  title, body, and all (= title+body).  We could have just one
set of queries that were run twice with a different parser (parsing into
"all", or parsing into "title" and "body").  That would provide another
interesting comparison -- a determination of whether or not
field-specific boosting is a benefit.

Chuck

  > -----Original Message-----
  > From: Doug Cutting [mailto:cutting@apache.org]
  > Sent: Monday, December 20, 2004 9:34 AM
  > To: Lucene Developers List
  > Subject: Re: DefaultSimilarity 2.0?
  > 
  > Chuck Williams wrote:
  > > Finally, I'd suggest picking content that has multiple fields and
  > allow
  > > the individual implementations to decide how to search these
fields --
  > > just title and body would be enough.  I would like to use my
  > > MaxDisjunctionQuery and see how it compares to other approaches
(e.g.,
  > > the default MultiFieldQueryParser, assuming somebody uses that in
this
  > > test).
  > 
  > I think that would be a good contest too, but I'd rather first just
  > focus on the ranking of single-field queries.  There are a number of
  > issues that come up with multi-field queries that I'd rather
postpone in
  > order to reduce the number of variables we test at one time.
  > 
  > Doug
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message