pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brzrk One <brz...@gmail.com>
Subject Re: (pdffile) does not allow extracting content
Date Tue, 23 Feb 2016 22:30:05 GMT
Well, I'm looking at the output of our application code, which is
unfortunately both proprietary and much more complicated than the command
line utilities.

After loading this document with loadNonSeq(), getting the warning, and
continuing to do our processing (mostly adding annotations), the output
contains all blank pages.

The same processing code with load() produces output with content on all
the pages.

On Tue, Feb 23, 2016 at 4:22 PM, Tilman Hausherr <THausherr@t-online.de>
wrote:

> Am 23.02.2016 um 22:19 schrieb Brzrk One:
>
>> I get all that. I just don't see where in loadNonSeq() it is refusing to
>> copy content.
>>
>
> Not at all. It is really just a warning when opening.
>
> The refusal is in the command line utility source.
>
>
> Tilman
>
>
>> On Tue, Feb 23, 2016 at 3:40 PM, Tilman Hausherr <THausherr@t-online.de>
>> wrote:
>>
>> Am 23.02.2016 um 21:33 schrieb Brzrk One:
>>>
>>> loadNonSeq() seems to respect it too?
>>>> sneaking in a bogus AccessPermissions.canExtractContent() did not alter
>>>> this.
>>>> What does load() do that loadNonSeq() does not? (Or vice versa.)
>>>>
>>>> They use different parsing strategies. Additionally, a difference is
>>> that
>>> loadNonSeq immediately decrypts, and brings up the warning.
>>>
>>>
>>> Tilman
>>>
>>>
>>> On Tue, Feb 23, 2016 at 2:54 PM, Tilman Hausherr <THausherr@t-online.de>
>>>> wrote:
>>>>
>>>> Am 23.02.2016 um 20:44 schrieb Brzrk One:
>>>>
>>>>> The file is:
>>>>>
>>>>>> http://www.bmv.com.mx/docs-pub/infoifrs/infoifrs_588674_2015-01_1.pdf
>>>>>>
>>>>>> The file is indeed protected against text extraction. Our command
line
>>>>>>
>>>>> utilities respect this. The methods (of PDFTextStripper) ignore it,
>>>>> they
>>>>> expect you to handle it. See in the examples source code how to extract
>>>>> text.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 23, 2016 at 12:05 PM, Tilman Hausherr <
>>>>> THausherr@t-online.de
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> Am 23.02.2016 um 17:53 schrieb Brzrk One:
>>>>>>
>>>>>> With pdfbox-1.8.11, using the bottom-up parser (loadNonSeq) on a
>>>>>>> document
>>>>>>>
>>>>>>> that has security ContentCopying: NotAllowed results in:
>>>>>>>>
>>>>>>>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser - PDF
file
>>>>>>>> 'some_temp_file.pdf' does not allow extracting content
>>>>>>>>
>>>>>>>> And the output pages are all blank.
>>>>>>>>
>>>>>>>> The top-down parser (load) has no such issue.
>>>>>>>>
>>>>>>>> Is there a workaround?
>>>>>>>>
>>>>>>>>
>>>>>>>> I looked in the source code, this warning comes only in the
non
>>>>>>>>
>>>>>>>> sequential
>>>>>>> parser. There's a similar error message in the ExtractText command
>>>>>>> line
>>>>>>> utility ("You do not have permission to extract text").
>>>>>>>
>>>>>>> The best would be to upload the file somewhere.
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>>
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message