lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Can we use TREC data set in open source?
Date Tue, 10 Sep 2013 05:10:59 GMT
I read here http://lemurproject.org/clueweb09/ that there is a hosted
version of ClueWeb09 (the latest is ClueWeb12, for which I don't find a
hosted version), and to get access to it, someone from the ASF will need to
sign an Organizational Agreement with them as well as each individual in
the project will need to sign an Individual Agreement (retained by the
ASF). Perhaps this can be available only to committers.

Though, we need to get access to ClueWeb12 if we want to publish Lucene
results on the latest data set. TREC papers are already based on that
version.

But if we just want to measure performance, relevancy etc., ClueWeb09 could
be a good start.

Shai

On Tue, Sep 10, 2013 at 5:53 AM, Han Jiang <jianghan08@gmail.com> wrote:

> Back in 2007 Grant contacted with NIST about making TREC collection
> available to our community:
>
> http://mail-archives.apache.org/mod_mbox/lucene-dev/200708.mbox/browser
>
> I think a try for this is really important to our project and people who
> use Lucene. All these years the speed performance is mainly tuned on
> Wikipedia, however it's not very 'standard':
>
> * it doesn't represent how real-world search works;
> * it cannot be used to evaluate the relevance of our scoring models;
> * researchers tend to do experiments on other data sets, and usually it is
>   hard to know whether Lucene performs its best performance;
>
> And personally I agree with this line:
>
> > I think it would encourage Lucene users/developers to think about
> > relevance as much as we think about speed.
>
> There's been much work to make Lucene's scoring models pluggable in 4.0,
> and it'll be great if we can explore more about it. It is very appealing
> to
> see a high-performance library work along with state-of-the-art ranking
> methods.
>
>
> And about TREC data set, the problems we met are:
>
> 1. NIST/TREC does not own the original collections, therefore it might be
>    necessary to have direct contact with those organizations who really
> did,
>    such as:
>
>    http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html
>    http://lemurproject.org/clueweb12/
>
> 2. Currently, there is no open-source license for any of the data sets, so
>    it won't be as 'open' as Wikipedia is.
>
>    As is proposed by Grant, a possibility is to make the data set
> accessible
>    only to committers instead of all users. It is not very open-source
> then,
>    but TREC data sets is public and usually available to researchers, so
>    people can still reproduce performance test.
>
> I'm quite curious, has anyone explored getting an open-source license for
> one of those data sets? And is our community still interested about this
> issue after all these years?
>
>
>
> --
> Han Jiang
>
> Team of Search Engine and Web Mining,
> School of Electronic Engineering and Computer Science,
> Peking University, China
>

Mime
View raw message