lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Test corpus
Date Sun, 02 Apr 2006 10:45:08 GMT
Marvin Humphrey wrote:
> Greets,
>
> I'm looking for a test corpus to use for some benchmarking and parsing 
> tests.  I can whip one up myself, but it would be nice to use 
> something standardized.  I'd like something that doesn't require a 
> license/fee, so that other people can run the same tests.  At least 
> 1000 docs, a few hundred words each.  Any suggestions?

20 newsgroups or the old Reuters corpus are freely available, and 
contain sufficient number of documents.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message