lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Can we use TREC data set in open source?
Date Mon, 16 Sep 2013 22:11:41 GMT
Inline below

On Sep 9, 2013, at 10:53 PM, Han Jiang <jianghan08@gmail.com> wrote:

> Back in 2007 Grant contacted with NIST about making TREC collection 
> available to our community: 
> 
> http://mail-archives.apache.org/mod_mbox/lucene-dev/200708.mbox/browser
> 
> I think a try for this is really important to our project and people who 
> use Lucene. All these years the speed performance is mainly tuned on 
> Wikipedia, however it's not very 'standard':
> 
> * it doesn't represent how real-world search works; 
> * it cannot be used to evaluate the relevance of our scoring models;
> * researchers tend to do experiments on other data sets, and usually it is 
>   hard to know whether Lucene performs its best performance; 
> 
> And personally I agree with this line:
> 
> > I think it would encourage Lucene users/developers to think about 
> > relevance as much as we think about speed.
> 
> There's been much work to make Lucene's scoring models pluggable in 4.0, 
> and it'll be great if we can explore more about it. It is very appealing to 
> see a high-performance library work along with state-of-the-art ranking 
> methods. 
> 
> 
> And about TREC data set, the problems we met are:
> 
> 1. NIST/TREC does not own the original collections, therefore it might be 
>    necessary to have direct contact with those organizations who really did,
>    such as:
> 
>    http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html
>    http://lemurproject.org/clueweb12/
> 
> 2. Currently, there is no open-source license for any of the data sets, so 
>    it won't be as 'open' as Wikipedia is.
> 
>    As is proposed by Grant, a possibility is to make the data set accessible
>    only to committers instead of all users. It is not very open-source then,
>    but TREC data sets is public and usually available to researchers, so 
>    people can still reproduce performance test.
> 
> I'm quite curious, has anyone explored getting an open-source license for 
> one of those data sets? And is our community still interested about this 
> issue after all these years?
> 

It continues to be of interest to me.  I've had various conversations throughout the years
on it.  Most people like the idea, but are not sure how to distribute it in an open way (ClueWeb
comes as 4 1TB disks right now) and I am also not sure how they would handle any copyright/redaction
claims against it.  There is, of course, little incentive for those involved to solve these,
either, as most people who are interested sign the form and pay the $600 for the disks.  I've
had a number of conversations about how I view this to be a significant barrier to open research,
esp. in under-served countries and to open source.  People sympathize with me, but then move
on.

To this day, I think the only way it will happen is for the "community" to build a completely
open system, perhaps based off of Common Crawl or our own crawl and host it ourselves and
develop judgments, etc.  We tried to get this off the ground w/ the Open Relevance Project,
but there was never a sustainable effort, and thus I have little hope at this point for it
(but I would love to be proven wrong)  For it to succeed, I think we would need the backing
of a University with students interested in curating such a collection, the judgments, etc.
 I think we could figure out how to distribute the data either as an AWS public data set or
possibly via the ASF or similar (although I am pretty sure the ASF would balk at multi-TB
sized downloads).  

Happy to hear other ideas.

--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com






Mime
View raw message