lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject TREC Collection, NIST and Lucene
Date Tue, 07 Aug 2007 21:57:28 GMT
DISCLAIMER: Just to be clear, what follows is my personal opinion and  
in no way, shape or form reflects an official position from the  
Lucene project:

So, now that we have all this great stuff for running TREC  
experiments in contrib/benchmark, I am wondering if people think it  
would be useful to send an open letter (and I mean an official letter  
from the ASF in a similar vein to http://apache.org/jcp/ 
sunopenletter.html) to the people that run TREC (i.e. NIST) inquiring  
if there is a way in which we can let our users obtain one or more  
TREC collections for running experiments.

It seems to me, that having the Lucene community  (and other Open  
Source search projects if they want) involved in TREC would be a real  
plus for the competition since we could serve as a baseline AND since  
we are transparent in what we do (i.e. our algorithms are open for  
public scrutiny) we can truly encourage open research.  After all,  
the whole goal of TREC (according to their website) is:
        "...to encourage research in information retrieval from large  
text collections."

By lowering the cost of entry, we truly could further this goal.   
After all, isn't furthering research about others being able to  
repeat experiments?  If you don't have the data, you can't repeat the  
experiment.

As for benefits to us, it allows us to do direct comparisons of  
Lucene and gives us some data points about how Lucene performs in  
terms of precision and recall (not that TREC is the be all, end all  
for measuring relevance, but...)  Furthermore, I think it would  
encourage Lucene users/developers to think about relevance as much as  
we think about speed.  Also, I think it would help us think about how  
to make Lucene scoring more pluggable (and still fast) such that we  
could make alternate relevancy models available similar to the  
Axiomatic Retrieval Scoring that was recently proposed.

Currently, the data is copyrighted and you pay to gain access, as I  
understand it (it has been a while since I ran TREC).  So, do people  
have suggestions on ways we could address this?  Maybe people have to  
sign a waiver or something or maybe the ASF could work out something  
with NIST or maybe the license could allow for personal use?  Really  
speculating here...  The key is the data needs to be free for open  
source use.  I don't think it needs to be ASF licensed.   Perhaps if  
we can present some possible solutions to the problem, our proposal  
will be more likely to be accepted.

Obviously, we would have to discuss this with the ASF as well to see  
if they would support it.

So, is this worthwhile to people?  Am I barking up the wrong tree?  I  
am willing to write up the letter and do the legwork, but I want to  
know the community is behind it as well.  Perhaps it would be better  
to do this informally?  Maybe just send an inquiry saying "Hey, we're  
from the Lucene project and we would love to be able to do TREC runs,  
can you help us out? Yada, yada, yada..."

Just thinking out loud,
Grant



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message