Return-Path: X-Original-To: apmail-pdfbox-dev-archive@www.apache.org Delivered-To: apmail-pdfbox-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C994C8001 for ; Tue, 16 Aug 2011 15:16:23 +0000 (UTC) Received: (qmail 78420 invoked by uid 500); 16 Aug 2011 15:16:23 -0000 Delivered-To: apmail-pdfbox-dev-archive@pdfbox.apache.org Received: (qmail 78350 invoked by uid 500); 16 Aug 2011 15:16:22 -0000 Mailing-List: contact dev-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pdfbox.apache.org Delivered-To: mailing list dev@pdfbox.apache.org Received: (qmail 78331 invoked by uid 99); 16 Aug 2011 15:16:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Aug 2011 15:16:22 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of antoni_mylka@poczta.onet.pl designates 213.180.142.140 as permitted sender) Received: from [213.180.142.140] (HELO smtpo09.poczta.onet.pl) (213.180.142.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Aug 2011 15:16:15 +0000 Received: from [192.168.0.7] (jyd114.internetdsl.tpnet.pl [95.50.3.114]) (Authenticated sender: antoni_mylka@poczta.onet.pl) by smtp.poczta.onet.pl (Onet) with ESMTPA id EC7722007FB6B; Tue, 16 Aug 2011 17:15:54 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=poczta.onet.pl; s=2011; t=1313507755; bh=I1rGirfmC4yv2EMZAKwwwY1b6pSwCkQZWGe3MVALLgM=; h=Message-ID:Date:From:MIME-Version:To:CC:Subject:References: In-Reply-To:Content-Type:Content-Transfer-Encoding; b=er1olBJHaU3Eh8w9blnCs1yG2Dkxwi0ubt2KUPWNHtabTpjAbKvKxdgpTvxpf1nyI b4GleF1J8VxgfCIwvQo+RLP5fn+8+foV6xfai7giF4vnoaZ0Aox8bMphApuXsRd77T +KVk2t3/qXz/3JN8F2G8v8tRbhzDk9lsfmyAAbxg= Message-ID: <4E4A899D.5070205@poczta.onet.pl> Date: Tue, 16 Aug 2011 17:15:41 +0200 From: Antoni Mylka User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20110624 Thunderbird/5.0 MIME-Version: 1.0 To: dev@pdfbox.apache.org CC: POI Developers List Subject: Re: Test documents References: <4E4A4A6D.7030803@poczta.onet.pl> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org Hi, I'm cc-ing this to dev@poi. I asked on dev@pdfbox about the policy for handing test documents which are public, but not explicitly licensed to ASF for "redistribution". W dniu 2011-08-16 14:29, Jukka Zitting pisze: > Hi, > > On Tue, Aug 16, 2011 at 12:46 PM, Antoni Mylka > wrote: >> Is this because pdfbox is liberal (don't require unit tests, keep the >> barriers to patches low), or conservative (copyright on the pdfs is tricky, >> don't commit them)? Is there any "official" policy? > > Better test coverage is always a good thing and should be our goal. > > That said, many of the example PDF files we see (like the one on > PDFBOX-1010) don't come with a license that would allow them to be > redistributed as a part of an Apache project. See [1] for Apache > guidelines on how to handle external material that hasn't explicitly > been contributed for redistribution by the ASF. > > See also [2] for related earlier work in dealing with test files with > unknown or unacceptable licensing status. > >> I do much of my text-extraction regression testing on the "govdocs1" dataset >> [1,2,3,4]. There are on the order of 300 thousand PDFs in there. All have >> been downloaded from public-facing websites owned by some US Government >> organization. They are all public, yet the copyright cannot be transferred >> to ASF. Are they OK? > > This is probably a question best answered by legal-discuss@apache.org. > My intuition says that the best way to handle such material would be > by reference. For example a test case could refer to specific > documents within the corpus by path or document id, and would only be > executed when the user has explicitly downloaded the corpus and made > it available to the PDFBox build. There doesn't seem to be much information on any "external material" which is not a library on the ASF Legal FAQ [1]. I guess I'd ask on legal-discuss. My idea is to include such tests in a separate suite which would download the docs using some URL list. The suite would NOT run by default. It could even lie outside the main source tree. URL lists can quickly get out of date and a release must compile after 10 years. This would allow for automated testing of docs from govdocs1 [3,4,5], JIRA issues, old pdfbox SF issues and any public website stable enough to hold a file for a long time, everything which by ASF policy cannot be committed to the SVN. Do you think it's a good idea? The same problem applies to POI. I used a govdocs document as an example in POI issue number 51524. Sergey Vladimirov committed it to Apache SVN. Now Jukka says that it's unacceptable. Should the 51524 test be disabled and the said file deleted? Antoni Myłka antoni.mylka@gmail.com [1] http://www.apache.org/legal/resolved.html [2] https://issues.apache.org/jira/browse/PDFBOX-391 [3] http://digitalcorpora.org/corpora/files [4] http://www.dfrws.org/2009/proceedings/p2-garfinkel.pdf [5] http://domex.nps.edu/corp/files/govdocs1/