Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Mime-Version: 1.0 (Apple Message framework v752.3)
Content-Transfer-Encoding: 7bit
Message-Id: <84C17B4F-CB4F-4E36-88A6-9E06800CEE00@apache.org>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
To: java-dev@lucene.apache.org
From: Grant Ingersoll <gsingers@apache.org>
Subject: Re: TREC Collection, NIST and Lucene
Date: Fri, 24 Aug 2007 17:52:58 -0400

Inline below is the response from Ms. Ellen Voorhees (person in  
charge of TREC) concerning my inquiry about gaining access to TREC  
data.  I suggest reading from the bottom and working your way up.  I  
edited out some of the copies of old messages to shorten the length  
here.

As you can read, there is some opportunity in here for us to gain  
access to TREC data.  The bigger opportunity (and work), I feel, may  
be the chance to, going forward, help NIST create and distribute  
collections with an open source license and make them freely  
available to anyone to use.

My suggestion at this point, would be to figure out if there are ways  
we as a community could help and also think about if it is worthwhile  
to find a way to purchase 1 or more collections for use by committers  
(we could make the data available on zones.

So, what do people think?

Begin forwarded message:

> From: Ellen Voorhees <Ellen.Voorhees@nist.gov>
> Date: August 24, 2007 2:43:35 PM EDT
> To: Grant Ingersoll <gsingers@apache.org>
> Cc: ellen.voorhees@nist.gov
> Subject: Re: TREC Evaluation and Open Source
>
>
>>
>>>
>>> So, I think the current scheme could work for Lucene committers/ 
>>> active contributors provided there is
>>> a central machine that all have access to.  (I admit to pretty  
>>> much total ignorance in the actual
>>> practice of an open source project.)  If the cost of getting the  
>>> documents is too great, Lucene as
>>> a project could sign up to participate in TREC and obtain many of  
>>> the document sets for free.
>>>
>>
>> Hmmm, we do have a machine that only committers have access to.   
>> So, if the ASF were to purchase a copy of the data, we could put  
>> it on this machine and use it, correct?  And individual committers/ 
>> contributors could be given access to it as long as they sign the  
>> individual forms?
>
> Yes, precisely.
>
>
>>
>>> This is not a solution for the Lucene community, which is too  
>>> large and far-flung to count
>>> as an "organization" in the spirit of the Data Use forms.  For  
>>> collections that are already
>>> created under existing agreements, I see no alternative to  
>>> community members obtaining the document
>>> sets on their own.  On the other hand, if for community members  
>>> you are mostly interested in distributing some
>>> data set that will show whether Lucene is installed correctly  
>>> (i.e., not a test collection that IR research should be
>>> done on), the subset of the TREC ad hoc collections containing  
>>> just the Federal Register documents
>>> can be used since those documents have no copyright restrictions  
>>> and we have some judgments
>>> and relevance judgments for them.  (But the documents are wonky  
>>> enough and there are few enough
>>> topics that I do not consider this a good test collection for  
>>> research.)
>>
>> We have demos and data sets for testing installation.  Mostly, we  
>> are looking for feedback in the traditional TREC spirit, i.e.  
>> running experiments, testing relevance algorithms, etc.  Also,  
>> testing scalability, etc.  Plus, it helps users make direct  
>> comparisons when choosing a search system
>
> Yes, that is what I originally figured you wanted the collections  
> for, and the Federal Register subset is not
> a viable candidate for that.   The standard ad hoc collections are  
> probably not sufficient for testing scalability---
> they are only 800,000-1,000,000 documents and about 2GB of text.   
> The collections built in the 'terabyte'
> track used a crawl of .GOV that is about .5TB of text (this is one  
> of the document sets distributed by
> the University of Glasgow).  Note that we (NIST TREC staff and  
> terabyte track organizers)
> have some reservations about the completeness of the relevance  
> judgments for the terabyte
> collections.
>>
>>>
>>> I think it would be fantastic for the community if there were a  
>>> good document set that
>>> was able to be distributed through an open source license only.   
>>> I'd be happy to use TREC to get
>>> topics and relevance judgments for such a document set so there  
>>> could be a readily-available,
>>> basic, ad hoc retrieval test collection.  But our (TREC staff)  
>>> experience to date with trying to find
>>> such document sets has been very negative.
>>>
>>>
>>
>> Can I have your permission to share our conversation with the  
>> larger Lucene development community on the java- 
>> dev@lucene.apache.org mailing list?  If you would like, I can  
>> summarize it instead and report back to the group.  I can run the  
>> summary by you first if you would like to edit it.
>>
>> Perhaps we can help with the collection task, although I can't  
>> promise it.  Your staff does a great job already and are  
>> undoubtedly the experts on it, but there may very well be  
>> individuals who are willing to help, under the proper guidance.   
>> Also, I wonder if groups like iBiblio, Creative Commons or  
>> Wikipedia might be able to help out.  I have met with Paul Jones  
>> at iBiblio before and they have an extensive collection of open  
>> source documents.  Just not sure if they fit the TREC criteria.   
>> Is the criteria publicly documented somewhere?
>>
>> Cheers,
>> Grant
>>
>
> You may share the conversation with the Lucene community, either  
> summarized or straight.
>
> We do not have a specific list of criteria for document sets.   
> Since a full (TREC) test collection is built
> by using the document set in a TREC track, the "vetting" process  
> has generally happened through
> the track proposal process.  In general, a track is focused on a  
> particular task, and the document
> set needs to be a reasonable surrogate for the types of documents  
> that are typical for that task.
> So, the genomics track has used subsets of the medical literature,  
> the web track used crawls of the web.
> We also want the document sets to be large enough to be  
> interesting---there is no point putting resources
> into building a test collection that no one believes is  
> representative of anything real.  If the relevance judgments
> are to be created by NIST assessors, then they have to be "general  
> information" sorts of things since
> we do not have a body of assessors with subject matter expertise in  
> any one area.  So, the genomics
> track judgments are not made at NIST,  and  the original TREC ad  
> hoc collections were
> mostly newswire.  We also want to make sure that the document  
> collection will be available
> for a (relatively) long time.  Again, there is no point committing  
> resources to create topics
> and relevance judgments unless the documents will be available for  
> a significant time.
> This latter point also implies taking a snapshot of dynamic  
> collections.  That is (one of)
> the reasons TREC has not used the live web or live Wikipedia as a  
> document collection---
> to have a standard test collection you need a frozen document set.
>
> FYI, the call for track proposals for TREC 2008 is currently open  
> until mid-September; see
> http://trec.nist.gov/tracks.html.  In my comments above about a  
> basic, ad hoc collection,
> I was basically envisioning a newsy collection, but that's probably  
> just lack of
> imagination on my part.  Of course, there is no requirement for you  
> to go through
> TREC to create a collection, and there would probably be little  
> point in doing so if the
> document set (or task) are such that NIST assessors can't do the  
> assessing.  There are other
> evaluation venues (NTCIR, CLEF), or you the Lucene community may  
> decide to just build it
> yourselves.  In the latter case, TREC staff can offer our advise on  
> what we've learned about
> collection building through the years.
>
> A second FYI, if you want to get some more of an idea of the  
> considerations that went
> into building the early TREC collections, chapter 2 of the TREC book
>    (  http://mitpress.mit.edu/catalog/item/default.asp? 
> ttype=2&tid=10667&mode=toc )
> is "The TREC Test Collections" authored by Donna Harman.
>
> Ellen
>
Begin forwarded message:

> From: Ellen Voorhees <Ellen.Voorhees@nist.gov>
> Date: August 24, 2007 11:17:17 AM EDT
> To: Grant Ingersoll <gsingers@apache.org>
> Cc: ellen.voorhees@nist.gov
> Subject: Re: TREC Evaluation and Open Source
>
> The way the TREC Data Use licenses currently work, the  
> "organization" that requests the
> data is the legal entity that owns the machine on which the data is  
> put.  (An example form is at http://www.nist.gov/srd/ 
> trec_org.htm .)  That organization defines who
> it is that may access the data, with the expectation that access  
> would require a person-specific
> account on the machine.  Each such person is to sign an Individual  
> form.  (The
> intent of the Individual forms is that, officially, a copyright  
> owner of the data may
> ask for a list of all individuals that have (or have had) access to  
> the data.  No one has ever
> asked for such a list, but the language is in the forms to allow  
> this.)  Individuals may access the data
> remotely, provided they do so through the account on the host  
> machine.  Individuals at
> a remote location may not make a copy of the data for their local  
> machine, because that would
> be redistributing the documents.
>
> So, I think the current scheme could work for Lucene committers/ 
> active contributors provided there is
> a central machine that all have access to.  (I admit to pretty much  
> total ignorance in the actual
> practice of an open source project.)  If the cost of getting the  
> documents is too great, Lucene as
> a project could sign up to participate in TREC and obtain many of  
> the document sets for free.
>
> This is not a solution for the Lucene community, which is too large  
> and far-flung to count
> as an "organization" in the spirit of the Data Use forms.  For  
> collections that are already
> created under existing agreements, I see no alternative to  
> community members obtaining the document
> sets on their own.  On the other hand, if for community members you  
> are mostly interested in distributing some
> data set that will show whether Lucene is installed correctly  
> (i.e., not a test collection that IR research should be
> done on), the subset of the TREC ad hoc collections containing just  
> the Federal Register documents
> can be used since those documents have no copyright restrictions  
> and we have some judgments
> and relevance judgments for them.  (But the documents are wonky  
> enough and there are few enough
> topics that I do not consider this a good test collection for  
> research.)
>
> I think it would be fantastic for the community if there were a  
> good document set that
> was able to be distributed through an open source license only.   
> I'd be happy to use TREC to get
> topics and relevance judgments for such a document set so there  
> could be a readily-available,
> basic, ad hoc retrieval test collection.  But our (TREC staff)  
> experience to date with trying to find
> such document sets has been very negative.
>
>
> Ellen
>
>
>
>
> Grant Ingersoll wrote:
>> Thank you for the detailed response.  By the way, this whole  
>> discussion, I figured, just falls under the category of: it can't  
>> hurt to ask.  I know the answer may very well be no and I  
>> completely understand why it should be for the reasons you have  
>> cited: creating these collections takes a lot of work and requires  
>> a lot of storage and bandwidth.  So, I hope I am not coming across  
>> as being critical of the current state of TREC.  I very much value  
>> what TREC does, I have participated it in the past and really  
>> enjoyed it (other than the long hours I put in running  
>> experiments :-)  )  The high quality of TREC is one of the reasons  
>> why I wanted to ask in the first place!
>>
>> I think what I am trying to find out more about is if there is any  
>> possibility that the Lucene community (or maybe just the  
>> committers or active contributors who are not prohibited from  
>> contributing based on where they live) could gain access to these  
>> documents.  That is, could the collections (or future collections)  
>> be licensed under an open source license and hosted somewhere that  
>> is publicly available and does not require a fee to be paid to LDC  
>> or the like.  Perhaps the ASF or iBiblio would do this, or maybe  
>> some company would, I don't know, but I am willing to ask the  
>> appropriate people.  There are plenty of places out there that  
>> provide mirrors, etc. for Apache and iBiblio for free such that  
>> storage or cost should not be an issue.
>>
>> I guess some of the difficulty lies in how open source is  
>> developed versus how commercial/research systems are developed.   
>> We don't have a pool of money that we can use to purchase document  
>> collections.  Right now, the best we can make publicly available  
>> to our users is Wikipedia, which they download and use with some  
>> of our tools.  It also isn't even clear to me what defines the  
>> organization that would be buying the collection if there were  
>> money.  For instance, if the Apache Soft. Found. purchased the  
>> document collection, would that mean that anyone at the ASF could  
>> use it?  The problem is, other than one full-time system admin,  
>> all of the ASF is a volunteer organization (and a rather large,  
>> global one at that.)  So, how do you define how a project like  
>> Lucene as a whole can use TREC if the ASF were to pay the fee?  It  
>> would be the equivalent of total redistribution to anyone.
>>
>> I think there are a couple of options that might work:
>> 1. We restrict usage to committers on the project who agree not to  
>> redistribute, etc. just like any other researcher/organization
>> 2. We make future collections available under an open source license
>>
>> Perhaps there may be a way in the future for Lucene members or the  
>> ASF to contribute to making the collection.  Knowing the Lucene  
>> community and ASF, I would bet Lucene people would volunteer.   
>> However, I am not in a position to volunteer the ASF or others at  
>> this point, but I am in a position to see if others are interested  
>> in doing so.
>>
>> Thanks,
>> Grant
>>
>> On Aug 22, 2007, at 3:40 PM, Ellen Voorhees wrote:
>>
>>> I am unclear as to what, precisely, you see as the issues.  In  
>>> particular, I would claim that
>>> TREC is an evaluation for the retrieval community as a whole.
>>>
>>> Participation in TREC is open to (almost) anyone*.  There is no  
>>> charge for participation itself,
>>> though participants are responsible for the registration fee and  
>>> travel expenses
>>> if they attend the meeting held in November.  It is also true  
>>> that participants must purchase
>>> the document sets used in some of the tasks, though he majority  
>>> of documents sets are free for participants.
>>>
>>> Individuals who do not participate in TREC can (and do) obtain  
>>> the TREC test collections.
>>> The topics and relevance judgments can be down-loaded directly  
>>> from the appropriate
>>> pages in the Data section of the TREC web site.  Non-participants  
>>> must purchase most of the
>>> document sets.
>>>
>>> We have made a very concerted effort to obtain document sets as  
>>> free from restrictions
>>> as possible.  Nonetheless, good (i.e., representative of content  
>>> people might actually search)
>>> documents tend to be the intellectual property of some  
>>> organization and thus subject to copyright.
>>> There are also administrative and distribution costs that must be  
>>> covered.  The majority of the
>>> document sets used in TREC are covered by a license that 1)  
>>> allows the data to be used
>>> for research purposes only and 2) prohibits the redistribution of  
>>> the documents by anyone
>>> other than the organization originally granted that right.   So,  
>>> some of the TREC document
>>> sets must be obtained from the Linguistic Data Consortium  
>>> (www.ldc.upenn.edu), some from NIST,
>>> some from the University of Glasgow (http://ir.dcs.gla.ac.uk/ 
>>> test_collections/).   Since the agreements
>>> are already in place with the original sources of documents for  
>>> the current collections, we cannot
>>> change the license agreements after the fact.  The least  
>>> expensive document sets are  US $180;
>>> the most expensive are 400 pounds.
>>>
>>> I am very much interested in knowing what specific obstacles keep  
>>> you from participating
>>> in TREC and any suggestions you may have for eliminating/ 
>>> minimizing those.   We are well aware that
>>> the fewer restrictions (of any kind) there are on data sets the  
>>> more use they receive and the
>>> more novel uses are made of them.  But we are equally aware of  
>>> the difficulties of obtaining
>>> large, representative, appropriate document sets that may be  
>>> distributed world-wide with
>>> no restrictions.
>>>
>>> Ellen Voorhees
>>>
>>> * The qualification is there because as federal employees NIST  
>>> staff members are prohibited from corresponding
>>> with certain countries.  Citizens of those countries are  
>>> therefore unable to participate in TREC.
>>>
>>
>
Begin forwarded message:

> From: Ellen Voorhees <Ellen.Voorhees@nist.gov>
> Date: August 22, 2007 3:40:15 PM EDT
> To: Grant Ingersoll <gsingers@apache.org>
> Cc: ellen.voorhees@nist.gov
> Subject: Re: TREC Evaluation and Open Source
>
> I am unclear as to what, precisely, you see as the issues.  In  
> particular, I would claim that
> TREC is an evaluation for the retrieval community as a whole.
>
> Participation in TREC is open to (almost) anyone*.  There is no  
> charge for participation itself,
> though participants are responsible for the registration fee and  
> travel expenses
> if they attend the meeting held in November.  It is also true that  
> participants must purchase
> the document sets used in some of the tasks, though he majority of  
> documents sets are free for participants.
>
> Individuals who do not participate in TREC can (and do) obtain the  
> TREC test collections.
> The topics and relevance judgments can be down-loaded directly from  
> the appropriate
> pages in the Data section of the TREC web site.  Non-participants  
> must purchase most of the
> document sets.
>
> We have made a very concerted effort to obtain document sets as  
> free from restrictions
> as possible.  Nonetheless, good (i.e., representative of content  
> people might actually search)
> documents tend to be the intellectual property of some organization  
> and thus subject to copyright.
> There are also administrative and distribution costs that must be  
> covered.  The majority of the
> document sets used in TREC are covered by a license that 1) allows  
> the data to be used
> for research purposes only and 2) prohibits the redistribution of  
> the documents by anyone
> other than the organization originally granted that right.   So,  
> some of the TREC document
> sets must be obtained from the Linguistic Data Consortium  
> (www.ldc.upenn.edu), some from NIST,
> some from the University of Glasgow (http://ir.dcs.gla.ac.uk/ 
> test_collections/).   Since the agreements
> are already in place with the original sources of documents for the  
> current collections, we cannot
> change the license agreements after the fact.  The least expensive  
> document sets are  US $180;
> the most expensive are 400 pounds.
>
> I am very much interested in knowing what specific obstacles keep  
> you from participating
> in TREC and any suggestions you may have for eliminating/minimizing  
> those.   We are well aware that
> the fewer restrictions (of any kind) there are on data sets the  
> more use they receive and the
> more novel uses are made of them.  But we are equally aware of the  
> difficulties of obtaining
> large, representative, appropriate document sets that may be  
> distributed world-wide with
> no restrictions.
>
> Ellen Voorhees
>
> * The qualification is there because as federal employees NIST  
> staff members are prohibited from corresponding
> with certain countries.  Citizens of those countries are therefore  
> unable to participate in TREC.
>
>
> Grant Ingersoll wrote:
>> Dear Ms. Voorhees,
>>
>> My name is Grant Ingersoll and I am committer on the Lucene Java  
>> search library (http://lucene.apache.org) at the Apache Software  
>> Foundation (ASF).  I am not, however, writing in any official  
>> capacity as a representative of the ASF.  Perhaps at a later date,  
>> this will change, but for now I just want to keep things informal.
>>
>> I am, however, interested in starting a discussion about how open  
>> source projects like Lucene could participate in future TREC  
>> evaluations, or at least gain access to TREC data resources.   
>> While the people involved in Lucene feel we have built a top notch  
>> search system, one of the things the community as a whole lacks is  
>> the ability to do formal evaluations like TREC offers, and thus  
>> research and development of new algorithms is hindered.  Granted,  
>> individuals may perform TREC evaluations given they have purchased  
>> a license to the data, but the community as a whole does not have  
>> this ability.
>>
>> I am wondering if there is some way in which we can arrange for  
>> open source projects to obtain access to the TREC collections.   
>> The biggest barrier for projects like Lucene, obviously, is the  
>> fee that needs to be paid.  Furthermore, there are undoubtedly  
>> distribution and copyright concerns.  Yet, a part of me feels that  
>> we can work something out through creative licensing or some other  
>> novel approach that protects the appropriate interests, furthers  
>> TREC's mission and supports the vibrant Open Source community  
>> around Lucene and other search engines.  Perhaps it would be  
>> possible to require that any participant who wants the TREC data  
>> must prove that they are appropriately affiliated with an official  
>> open source project, as defined by the Open Source Initiative  
>> (http://www.opensource.org).  Many tool vendors have similar  
>> licenses that allow open source participants to use their tool  
>> while working on open source projects.  Perhaps we could provide a  
>> similar approach to the TREC data.
>>
>> I feel this would benefit TREC substantially, by providing an  
>> open, baseline system for all the world to see and I see that it  
>> fits very much with the motto of TREC  "...to encourage research  
>> in information retrieval from large text collections."    
>> Naturally, it benefits Lucene by allowing Lucene to undertake more  
>> formal evaluation of relevance, etc.
>>
>> If you are interested in more background on this on the Lucene  
>> Java developers mailing list, please refer to
>> http://www.gossamer-threads.com/lists/lucene/java-dev/52022? 
>> search_string=TREC;#52022
>>
>> I look forward to hearing back from you and I would be more than  
>> happy to answer any questions you have.
>>
>> Sincerely,
>> Grant Ingersoll
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org