pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brzrk One <brz...@gmail.com>
Subject Re: (pdffile) does not allow extracting content
Date Tue, 23 Feb 2016 23:27:05 GMT
Yea, I think that's it.
Comparing the input pdf to the loadNonSeq() output, I see objects that have
the same content.
This means that the loadNonSeq() output is encrypted - like the input -
while the load() output is not. However, the loadNonSeq() output has no
/Encrypt dictionary.

I am using this on both paths:
    StandardDecryptionMaterial sdm = new StandardDecryptionMaterial("");
    document.openProtection(sdm);

without error.
Is this a feature of loadNonSeq() in the face of
AccessPermission.canExtractContent() == true?
Or did I do something wrong here?


On Tue, Feb 23, 2016 at 5:54 PM, Brzrk One <brzrk1@gmail.com> wrote:

> Ah! A clue!
>
> Opening the load() output with PDFDebugger from 2.0RC3 complains of a
> Registry error:
>    Feb 23, 2016 5:36:20 PM java.util.prefs.WindowsPreferences <init>
> WARNING:
>    Could not open/create prefs root node Software\JavaSoft\Prefs at root
> 0x80000002.
>    Windows RegCreateKeyEx(...) returned error code 5.
>
> Which I promptly ignored.
>
> Opening the loadNonSeq() output with PDFDebugger from 2.0RC3 includes the
> same Registry error, plus this diagnostic:
>
> Feb 23, 2016 5:36:33 PM org.apache.pdfbox.filter.FlateFilter decode
>     SEVERE: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
>
> and the stack trace shown below in a dialog box whenever it attempts to
> display a Page object,
> each due to:
>     java.util.zip.DataFormatException: unknown compression method
>
> The two output document structures are mostly the same, however for the
> loadNonSeq() output, the Info directory looks like it is still encoded:
>
> [image: Inline image 1]
>
> whereas, for the load() output, the Info directory looks decoded:
> [image: Inline image 2]
>
> Is this an decryption/re-encryption issue because of the /P setting?
>
> Here's the stack trace:
>
> java.lang.RuntimeException: java.util.concurrent.ExecutionException:
> java.io.IOException: java.util.zip.DataFormatException: unknown compression
> method
>
> org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.done(PagePane.java:175)
>     sun.swing.AccumulativeRunnable.run(AccumulativeRunnable.java:95)
> Caused by: java.util.concurrent.ExecutionException: java.io.IOException:
> java.util.zip.DataFormatException: unknown compression method
>
> org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.done(PagePane.java:164)
>     sun.swing.AccumulativeRunnable.run(AccumulativeRunnable.java:95)
> Caused by: java.io.IOException: java.util.zip.DataFormatException: unknown
> compression method
>     org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:82)
>     org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
>     org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
>     org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:148)
>
> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:92)
>
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:461)
>
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:445)
>
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>     org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:187)
>
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:208)
>
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:139)
>
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:68)
>
> org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.doInBackground(PagePane.java:155)
>
> org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.doInBackground(PagePane.java:138)
>     java.lang.Thread.run(Thread.java:619)
> Caused by: java.util.zip.DataFormatException: unknown compression method
>     java.util.zip.Inflater.inflateBytes(Native Method)
>     java.util.zip.Inflater.inflate(Inflater.java:238)
>     java.util.zip.Inflater.inflate(Inflater.java:256)
>     org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:104)
>     org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:73)
>     org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
>     org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
>     org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:148)
>
> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:92)
>
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:461)
>
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:445)
>
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>     org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:187)
>
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:208)
>
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:139)
>
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:68)
>
> org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.doInBackground(PagePane.java:155)
>
> org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.doInBackground(PagePane.java:138)
>     java.lang.Thread.run(Thread.java:619)
>
> On Tue, Feb 23, 2016 at 5:30 PM, Brzrk One <brzrk1@gmail.com> wrote:
>
>> Well, I'm looking at the output of our application code, which is
>> unfortunately both proprietary and much more complicated than the command
>> line utilities.
>>
>> After loading this document with loadNonSeq(), getting the warning, and
>> continuing to do our processing (mostly adding annotations), the output
>> contains all blank pages.
>>
>> The same processing code with load() produces output with content on all
>> the pages.
>>
>> On Tue, Feb 23, 2016 at 4:22 PM, Tilman Hausherr <THausherr@t-online.de>
>> wrote:
>>
>>> Am 23.02.2016 um 22:19 schrieb Brzrk One:
>>>
>>>> I get all that. I just don't see where in loadNonSeq() it is refusing to
>>>> copy content.
>>>>
>>>
>>> Not at all. It is really just a warning when opening.
>>>
>>> The refusal is in the command line utility source.
>>>
>>>
>>> Tilman
>>>
>>>
>>>> On Tue, Feb 23, 2016 at 3:40 PM, Tilman Hausherr <THausherr@t-online.de
>>>> >
>>>> wrote:
>>>>
>>>> Am 23.02.2016 um 21:33 schrieb Brzrk One:
>>>>>
>>>>> loadNonSeq() seems to respect it too?
>>>>>> sneaking in a bogus AccessPermissions.canExtractContent() did not
>>>>>> alter
>>>>>> this.
>>>>>> What does load() do that loadNonSeq() does not? (Or vice versa.)
>>>>>>
>>>>>> They use different parsing strategies. Additionally, a difference
is
>>>>> that
>>>>> loadNonSeq immediately decrypts, and brings up the warning.
>>>>>
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>> On Tue, Feb 23, 2016 at 2:54 PM, Tilman Hausherr <
>>>>>> THausherr@t-online.de>
>>>>>> wrote:
>>>>>>
>>>>>> Am 23.02.2016 um 20:44 schrieb Brzrk One:
>>>>>>
>>>>>>> The file is:
>>>>>>>
>>>>>>>>
>>>>>>>> http://www.bmv.com.mx/docs-pub/infoifrs/infoifrs_588674_2015-01_1.pdf
>>>>>>>>
>>>>>>>> The file is indeed protected against text extraction. Our
command
>>>>>>>> line
>>>>>>>>
>>>>>>> utilities respect this. The methods (of PDFTextStripper) ignore
it,
>>>>>>> they
>>>>>>> expect you to handle it. See in the examples source code how
to
>>>>>>> extract
>>>>>>> text.
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 23, 2016 at 12:05 PM, Tilman Hausherr <
>>>>>>> THausherr@t-online.de
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Am 23.02.2016 um 17:53 schrieb Brzrk One:
>>>>>>>>
>>>>>>>> With pdfbox-1.8.11, using the bottom-up parser (loadNonSeq)
on a
>>>>>>>>> document
>>>>>>>>>
>>>>>>>>> that has security ContentCopying: NotAllowed results
in:
>>>>>>>>>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser
- PDF file
>>>>>>>>>> 'some_temp_file.pdf' does not allow extracting content
>>>>>>>>>>
>>>>>>>>>> And the output pages are all blank.
>>>>>>>>>>
>>>>>>>>>> The top-down parser (load) has no such issue.
>>>>>>>>>>
>>>>>>>>>> Is there a workaround?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I looked in the source code, this warning comes only
in the non
>>>>>>>>>>
>>>>>>>>>> sequential
>>>>>>>>> parser. There's a similar error message in the ExtractText
command
>>>>>>>>> line
>>>>>>>>> utility ("You do not have permission to extract text").
>>>>>>>>>
>>>>>>>>> The best would be to upload the file somewhere.
>>>>>>>>>
>>>>>>>>> Tilman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>
>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message