pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: (pdffile) does not allow extracting content
Date Tue, 23 Feb 2016 19:54:16 GMT
Am 23.02.2016 um 20:44 schrieb Brzrk One:
> The file is:
> http://www.bmv.com.mx/docs-pub/infoifrs/infoifrs_588674_2015-01_1.pdf

The file is indeed protected against text extraction. Our command line 
utilities respect this. The methods (of PDFTextStripper) ignore it, they 
expect you to handle it. See in the examples source code how to extract 
text.

Tilman

>
> On Tue, Feb 23, 2016 at 12:05 PM, Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
>> Am 23.02.2016 um 17:53 schrieb Brzrk One:
>>
>>> With pdfbox-1.8.11, using the bottom-up parser (loadNonSeq) on a document
>>> that has security ContentCopying: NotAllowed results in:
>>>
>>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser - PDF file
>>> 'some_temp_file.pdf' does not allow extracting content
>>>
>>> And the output pages are all blank.
>>>
>>> The top-down parser (load) has no such issue.
>>>
>>> Is there a workaround?
>>>
>>>
>> I looked in the source code, this warning comes only in the non sequential
>> parser. There's a similar error message in the ExtractText command line
>> utility ("You do not have permission to extract text").
>>
>> The best would be to upload the file somewhere.
>>
>> Tilman
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message