pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: (pdffile) does not allow extracting content
Date Tue, 23 Feb 2016 20:40:41 GMT
Am 23.02.2016 um 21:33 schrieb Brzrk One:
> loadNonSeq() seems to respect it too?
> sneaking in a bogus AccessPermissions.canExtractContent() did not alter
> this.
> What does load() do that loadNonSeq() does not? (Or vice versa.)

They use different parsing strategies. Additionally, a difference is 
that loadNonSeq immediately decrypts, and brings up the warning.

Tilman

>
> On Tue, Feb 23, 2016 at 2:54 PM, Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
>> Am 23.02.2016 um 20:44 schrieb Brzrk One:
>>
>>> The file is:
>>> http://www.bmv.com.mx/docs-pub/infoifrs/infoifrs_588674_2015-01_1.pdf
>>>
>> The file is indeed protected against text extraction. Our command line
>> utilities respect this. The methods (of PDFTextStripper) ignore it, they
>> expect you to handle it. See in the examples source code how to extract
>> text.
>>
>> Tilman
>>
>>
>>
>>> On Tue, Feb 23, 2016 at 12:05 PM, Tilman Hausherr <THausherr@t-online.de>
>>> wrote:
>>>
>>> Am 23.02.2016 um 17:53 schrieb Brzrk One:
>>>> With pdfbox-1.8.11, using the bottom-up parser (loadNonSeq) on a document
>>>>> that has security ContentCopying: NotAllowed results in:
>>>>>
>>>>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser - PDF file
>>>>> 'some_temp_file.pdf' does not allow extracting content
>>>>>
>>>>> And the output pages are all blank.
>>>>>
>>>>> The top-down parser (load) has no such issue.
>>>>>
>>>>> Is there a workaround?
>>>>>
>>>>>
>>>>> I looked in the source code, this warning comes only in the non
>>>> sequential
>>>> parser. There's a similar error message in the ExtractText command line
>>>> utility ("You do not have permission to extract text").
>>>>
>>>> The best would be to upload the file somewhere.
>>>>
>>>> Tilman
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message