pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoni Mylka <antoni_my...@poczta.onet.pl>
Subject Re: Test documents
Date Tue, 16 Aug 2011 15:15:41 GMT
Hi,

I'm cc-ing this to dev@poi. I asked on dev@pdfbox about the policy for 
handing test documents which are public, but not explicitly licensed to 
ASF for "redistribution".

W dniu 2011-08-16 14:29, Jukka Zitting pisze:
> Hi,
>
> On Tue, Aug 16, 2011 at 12:46 PM, Antoni Mylka
> <antoni_mylka@poczta.onet.pl>  wrote:
>> Is this because pdfbox is liberal (don't require unit tests, keep the
>> barriers to patches low), or conservative (copyright on the pdfs is tricky,
>> don't commit them)? Is there any "official" policy?
>
> Better test coverage is always a good thing and should be our goal.
>
> That said, many of the example PDF files we see (like the one on
> PDFBOX-1010) don't come with a license that would allow them to be
> redistributed as a part of an Apache project. See [1] for Apache
> guidelines on how to handle external material that hasn't explicitly
> been contributed for redistribution by the ASF.
 >
> See also [2] for related earlier work in dealing with test files with
> unknown or unacceptable licensing status.
>
>> I do much of my text-extraction regression testing on the "govdocs1" dataset
>> [1,2,3,4]. There are on the order of 300 thousand PDFs in there. All have
>> been downloaded from public-facing websites owned by some US Government
>> organization. They are all public, yet the copyright cannot be transferred
>> to ASF. Are they OK?
>
> This is probably a question best answered by legal-discuss@apache.org.
> My intuition says that the best way to handle such material would be
> by reference. For example a test case could refer to specific
> documents within the corpus by path or document id, and would only be
> executed when the user has explicitly downloaded the corpus and made
> it available to the PDFBox build.

There doesn't seem to be much information on any "external material" 
which is not a library on the ASF Legal FAQ [1]. I guess I'd ask on 
legal-discuss.

My idea is to include such tests in a separate suite which would 
download the docs using some URL list. The suite would NOT run by 
default. It could even lie outside the main source tree. URL lists can 
quickly get out of date and a release must compile after 10 years. This 
would allow for automated testing of docs from govdocs1 [3,4,5], JIRA 
issues, old pdfbox SF issues and any public website stable enough to 
hold a file for a long time, everything which by ASF policy cannot be 
committed to the SVN. Do you think it's a good idea?

The same problem applies to POI. I used a govdocs document as an example 
in POI issue number 51524. Sergey Vladimirov committed it to Apache SVN. 
Now Jukka says that it's unacceptable. Should the 51524 test be disabled 
and the said file deleted?

Antoni Myłka
antoni.mylka@gmail.com

[1] http://www.apache.org/legal/resolved.html
[2] https://issues.apache.org/jira/browse/PDFBOX-391
[3] http://digitalcorpora.org/corpora/files
[4] http://www.dfrws.org/2009/proceedings/p2-garfinkel.pdf
[5] http://domex.nps.edu/corp/files/govdocs1/

Mime
View raw message