From lucene-dev-return-6022-apmail-jakarta-lucene-dev-archive=jakarta.apache.org@jakarta.apache.org Wed Apr 07 15:53:52 2004 Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 76335 invoked from network); 7 Apr 2004 15:53:52 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 7 Apr 2004 15:53:52 -0000 Received: (qmail 3382 invoked by uid 500); 7 Apr 2004 15:53:40 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 3354 invoked by uid 500); 7 Apr 2004 15:53:39 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 3291 invoked from network); 7 Apr 2004 15:53:39 -0000 Received: from unknown (HELO mere.cirano.qc.ca) (207.162.57.103) by daedalus.apache.org with SMTP; 7 Apr 2004 15:53:39 -0000 Received: from mere.cirano.qc.ca (localhost.localdomain [127.0.0.1]) by mere.cirano.qc.ca (8.12.8/8.12.8) with ESMTP id i37FrfiI029559 for ; Wed, 7 Apr 2004 11:53:41 -0400 Received: from localhost (vauchers@localhost) by mere.cirano.qc.ca (8.12.8/8.12.8/Submit) with ESMTP id i37FrfZf029555 for ; Wed, 7 Apr 2004 11:53:41 -0400 X-Authentication-Warning: mere.cirano.qc.ca: vauchers owned process doing -bs Date: Wed, 7 Apr 2004 11:53:41 -0400 (EDT) From: Stephane James Vaucher To: Lucene Developers List Subject: Re: looking for a large test corpus for a lucene presentation In-Reply-To: <4073C4C8.7090000@ctx.com.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N A few references: http://www.daviddlewis.com/resources/testcollections/reuters21578/ http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/ http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html sv On Wed, 7 Apr 2004, Matt Quail wrote: > Hi all, > > I'm doing a presentation to my local JUG on Lucene, and I'm looking for > a "good" set of documents to use as a demonstration. > > Ideally it would be: > 1) large (10,000 plus?). > 2) contain some metadata besides "body" (like author, date, primarykey, > etc). > 3) freely available. > > I was going to use the data from the previous Google programming > contest, but it doesn't seem to be available. > > If I can't find anything satisfactory, I'll probably: > - generate a fake whitepages phonebook > - grab documents from project Gutenberg > > My preference is for some "real" data, but I'm happy to generate fake > data if no-one has any better ideas. > > :D > > =Matt > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org