pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoni Mylka <antoni_my...@poczta.onet.pl>
Subject Test documents
Date Tue, 16 Aug 2011 10:46:05 GMT

I tried investigating PDFBOX-1075, discovered that it's related to a fix 
applied to PDFBOX-1010, but the earlier fix did not come with a unit 
test and I had to download a doc from directly from the JIRA to see if 
my fix didn't break the earlier one.

Is this because pdfbox is liberal (don't require unit tests, keep the 
barriers to patches low), or conservative (copyright on the pdfs is 
tricky, don't commit them)? Is there any "official" policy?

I do much of my text-extraction regression testing on the "govdocs1" 
dataset [1,2,3,4]. There are on the order of 300 thousand PDFs in there. 
All have been downloaded from public-facing websites owned by some US 
Government organization. They are all public, yet the copyright cannot 
be transferred to ASF. Are they OK?

Antoni Myłka

Short description:
[1] http://digitalcorpora.org/corpora/files
Longer description:
[2] http://www.dfrws.org/2009/proceedings/p2-garfinkel.pdf
A million documents:
[3] http://domex.nps.edu/corp/files/govdocs1/
A million documents packaged into 1000 zip files
[4] http://domex.nps.edu/corp/files/govdocs1/zipfiles/

View raw message