lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Open Source Relevance
Date Mon, 19 May 2008 19:58:09 GMT
Copied from

For a while now, I have been trying to get my hands on TREC data for  
the Lucene project.  For those who aren’t familiar, TREC is an annual  
competition for search engines that provides a common set of documents  
to index, queries to execute and judgments to check your answers to  
see how good an engine performs.  While it isn’t the be all, end all  
for relevance, it is a pretty good sanity check on how you are doing.   
For instance, many search engines do OK out of the box on it, but once  
you tune them, they can do much better.  Of course, you risk  
overtuning to TREC as well.

In TREC, the queries and the judgments are provided for free, but one  
has to pay for the data, or at least most of it, since it is usually  
owned by Reuters or some other organization.  It isn’t expensive or  
anything, but it is a barrier none the less, especially for an open  
source project.  Furthermore, the whole notion of paying for data in  
this day and age of open source and Creative Commons just doesn’t sit  
right with me.   Don’t get me wrong, I’m a big fan of TREC, having  
participated in the past, it provides a valuable service to the  
proprietary/academic IR community.

So, what does this have to do with Lucene?  When I say I am trying to  
get my hands on TREC data, I don’t mean just for me, I literally mean  
obtaining TREC data for Lucene.  That is, I want the data to be made  
available, ideally, for all Lucene (and, for that matter, all open  
source search engine) users to use and run experiments on so as to  
spur on innovation in Lucene’s scoring algorithms, etc.  Now, I know  
the copyright owners will never allow this, as I have asked.  So, my  
next thought was let’s just get it for internal use by committers at  
Apache.  So, I went back to TREC and we have an agreement to do this,  
more or less.  The problem, however, is that they say we can only use  
the data on ASF (Apache) machines.  Not a big deal, right?  Kind of.   
The ASF doesn’t really have the hardware to run TREC style  
experiments.  We pretty much have one Solaris “zone” alloted us (a  
“zone” is a virtual machine guest image running.)  Furthermore, the  
ASF is pretty much an all volunteer, worldwide distributed  
organization.  We do almost all of our work on our own machines as  
VOLUNTEERS.   Practically speaking, the best way for any of us to take  
advantage of the data is to have it locally, which I am told, isn’t  
going to happen.

So, what’s the point?  I think it is time the open source search  
community (and I don’t mean just Lucene) develop and publish a set of  
TREC-style relevance judgments for freely available data that is  
easily obtained from the Internet.  Simply put, I am wondering if  
there are volunteers out there who would be willing to develop a  
practical set of queries and judgments for datasets like Wikipedia,  
iBiblio, the Internet Archive, etc.  We wouldn’t host these datasets,  
we would just provide the queries and judgments, as well as the info  
on how to obtain the data.  Then, it is easy enough to provide simple  
scripts that do things like run Lucene’s contrib/benchmark Quality  
tasks against said data.

Practically speaking, I don’t think we even need to go as deep as  
TREC.  I think we would find the most use in making judgments on the  
top 10 or 20 results for any given query.

So, what do others think?  Am I off my rocker?  Are there any  
volunteers out there?  I think we could do this pretty simply through  
some scripts, and the effective use of a wiki.  I don’t think our goal  
is, in the short run, to be scientifically rigorous, but it should be  
over time.  Instead, I think our goal is to run a practical relevance  
test like any organization should when deploying search: take 50 (top)  
queries and judge them, as well as 20 or so random queries and judge  
them.  (I wonder if Wikipedia would give us there top 50 queries, or  
maybe it is already available.)  Over time, we can add queries, and  
refine judgments using the web 2.0 mentality of the wisdom of crowds.

FWIW, there is probably some alignment with the Wikia search project.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message