lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <ch...@manawiz.com>
Subject RE: DefaultSimilarity 2.0?
Date Sat, 18 Dec 2004 18:07:36 GMT
I haven't run it yet to take a look at the collections, but the code
looks fine.  subject and body will make good content fields to query
against.  I think we need a couple additional things, though:
  1.  An interactive UI for trying queries -- should be a webapp so that
people can use it.  The batch query UI should be maintained for running
a standard test set.  There needs to be a way to see the results of the
batch test (didn't look carefully at how this is done now -- emphasis in
this test is on understanding the ordering and scoring of results, not
on performance, although basic timing should be included in case any of
the implementations differ on this dimension).
  2.  Instead of using QueryParser against body, it should use
MultiFieldQueryParser against subject and body (or maybe against
subject, from and body).  Apps may change this (I will change it to use
my approach for multiple fields).

Chuck

  > -----Original Message-----
  > From: Andrzej Bialecki [mailto:ab@getopt.org]
  > Sent: Friday, December 17, 2004 4:06 PM
  > To: Lucene Developers List
  > Subject: Re: DefaultSimilarity 2.0?
  > 
  > Chuck Williams wrote:
  > > I think this is a great idea and would be happy to play the game.
Re.
  > > the collection, there is some benefit to TREC if somebody is going
to
  > do
  > > formal recall and precision computations, otherwise it doesn't
matter
  > > much.  The best Similarity for any collection is likely to be
specific
  > > to the collection, so if the point here is to pick the best
  > > DefaultSimilarity, the collection should be as representative of
  > Lucene
  > > users' content as possible (I know this is probably impossible to
  > > achieve).
  > >
  > > One possible danger in these kinds of bake-offs is that people who
  > know
  > > the content will likely craft specific queries that are not
reflective
  > > of real users.  It would be good to at least have a standard set
of
  > > queries that was tested against each implementation.  Perhaps each
  > > person could contribute a set of test queries in addition to their
  > > Similarity and the combined query set could be tested against
each.
  > >
  > > Finally, I'd suggest picking content that has multiple fields and
  > allow
  > > the individual implementations to decide how to search these
fields --
  > > just title and body would be enough.  I would like to use my
  > > MaxDisjunctionQuery and see how it compares to other approaches
(e.g.,
  > > the default MultiFieldQueryParser, assuming somebody uses that in
this
  > > test).
  > 
  > I believe the collection that I'm using in LuceneBenchmark meets
most if
  > not all of these requirements - the "20 newsgroups" corpus. Please
see
  > the following link for the benchmark code:
  > 
  > 	http://www.getopt.org/lb/LuceneBenchmark.java
  > 
  > 
  > This collection has the benefit that it's relatively easy to judge
the
  > relative relevance scores, because the nature and structure of the
  > corpus is well understood.
  > 
  > --
  > Best regards,
  > Andrzej Bialecki
  >   ___. ___ ___ ___ _ _   __________________________________
  > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  > ___|||__||  \|  ||  |  Embedded Unix, System Integration
  > http://www.sigram.com  Contact: info at sigram dot com
  > 
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message