Doug Cutting wrote:
> Chuck Williams wrote:
>
>> Another issue will likely be the tf() and idf() computations. I have a
>> similar desired relevance ranking and was not getting what I wanted due
>> to the idf() term dominating the score. [ ... ]
>
>
> Chuck has made a series of criticisms of the DefaultSimilarity
> implementation. Unfortunately it is difficult to quickly evaluate
> these, as it requires relevance judgements. But, still, we should
> consider modifying DefaultSimilarity for the 2.0 release if there are
> easy improvements to be had. But how do we decide what's better?
>
> Perhaps we should perform a formal or semi-formal evaluation of various
> Similarity implementations on a reference collection. For example, for
> a formal evalution we might use one the TREC Web collections, which have
> associated queries and relevance judgements. Or, less formally, we
> could use a crawl of the ~5M pages in DMOZ (I would be glad to collect
> these using Nutch).
>
> This could work as follows:
> -- Different folks could download and index a reference collection,
> offering demonstration search systems. We would provide complete code.
> These would differ only in their Similarity implementation. All
> implementations would use the same Analyzer and search only a single field.
> -- These folks could then announce their candiate implementations and
> let others run queries against them, via HTTP. Different Similarity
> implementations could thus be publicly and interactively compared.
> -- Hopefully a consensus, or at least a healthy majority, would agree
> on which was the best implementation and we could make that the default
> for Lucene 2.0.
>
> Are there folks (e.g., Chuck) who would be willing to play this game?
I can prob play the game and offer resources, esp if disk space needed
is not many GB...1GB is fine. I'm just not clear on how many people you
need participating - one person per Similarity proposal? I do not have a
Similarity proposal myself...
> Should we make it more formal, using, e.g., TREC? Does anyone have
> other ideas how we should decide how to modify DefaultSimilarity?
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
|