pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: (pdffile) does not allow extracting content
Date Wed, 24 Feb 2016 07:02:42 GMT
Am 24.02.2016 um 00:27 schrieb Brzrk One:
> Yea, I think that's it.
> Comparing the input pdf to the loadNonSeq() output, I see objects that 
> have the same content.
> This means that the loadNonSeq() output is encrypted - like the input 
> - while the load() output is not. However, the loadNonSeq() output has 
> no /Encrypt dictionary.
>
> I am using this on both paths:
>     StandardDecryptionMaterial sdm = new StandardDecryptionMaterial("");
>     document.openProtection(sdm);

You shouldn't use this on loadNonSeq, or in 2.0 (it isn't available 
there anyway).

You only need it with load() in 1.8.

>
> without error.
> Is this a feature of loadNonSeq() in the face of 
> AccessPermission.canExtractContent() == true?
> Or did I do something wrong here?

You need openProtection() only with load() in 1.8 and only if the file 
is encrypted. (Yours is)

Tilman

>
>
> On Tue, Feb 23, 2016 at 5:54 PM, Brzrk One <brzrk1@gmail.com 
> <mailto:brzrk1@gmail.com>> wrote:
>
>     Ah! A clue!
>
>     Opening the load() output with PDFDebugger from 2.0RC3 complains
>     of a Registry error:
>        Feb 23, 2016 5:36:20 PM java.util.prefs.WindowsPreferences
>     <init> WARNING:
>        Could not open/create prefs root node Software\JavaSoft\Prefs
>     at root 0x80000002.
>        Windows RegCreateKeyEx(...) returned error code 5.
>
>     Which I promptly ignored.
>
>     Opening the loadNonSeq() output with PDFDebugger from 2.0RC3
>     includes the same Registry error, plus this diagnostic:
>
>     Feb 23, 2016 5:36:33 PM org.apache.pdfbox.filter.FlateFilter decode
>         SEVERE: FlateFilter: stop reading corrupt stream due to a
>     DataFormatException
>
>     and the stack trace shown below in a dialog box whenever it
>     attempts to display a Page object,
>     each due to:
>         java.util.zip.DataFormatException: unknown compression method
>
>     The two output document structures are mostly the same, however
>     for the loadNonSeq() output, the Info directory looks like it is
>     still encoded:
>
>     Inline image 1
>
>     whereas, for the load() output, the Info directory looks decoded:
>     Inline image 2
>
>     Is this an decryption/re-encryption issue because of the /P setting?
>
>     Here's the stack trace:
>
>     java.lang.RuntimeException:
>     java.util.concurrent.ExecutionException: java.io.IOException:
>     java.util.zip.DataFormatException: unknown compression method
>     org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.done(PagePane.java:175)
>     sun.swing.AccumulativeRunnable.run(AccumulativeRunnable.java:95)
>     Caused by: java.util.concurrent.ExecutionException:
>     java.io.IOException: java.util.zip.DataFormatException: unknown
>     compression method
>     org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.done(PagePane.java:164)
>     sun.swing.AccumulativeRunnable.run(AccumulativeRunnable.java:95)
>     Caused by: java.io.IOException: java.util.zip.DataFormatException:
>     unknown compression method
>     org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:82)
>     org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
>     org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
>     org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:148)
>     org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:92)
>     org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:461)
>     org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:445)
>     org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>     org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:187)
>     org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:208)
>     org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:139)
>     org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:68)
>     org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.doInBackground(PagePane.java:155)
>     org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.doInBackground(PagePane.java:138)
>         java.lang.Thread.run(Thread.java:619)
>     Caused by: java.util.zip.DataFormatException: unknown compression
>     method
>         java.util.zip.Inflater.inflateBytes(Native Method)
>         java.util.zip.Inflater.inflate(Inflater.java:238)
>         java.util.zip.Inflater.inflate(Inflater.java:256)
>     org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:104)
>     org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:73)
>     org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
>     org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
>     org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:148)
>     org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:92)
>     org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:461)
>     org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:445)
>     org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>     org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:187)
>     org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:208)
>     org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:139)
>     org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:68)
>     org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.doInBackground(PagePane.java:155)
>     org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.doInBackground(PagePane.java:138)
>         java.lang.Thread.run(Thread.java:619)
>
>     On Tue, Feb 23, 2016 at 5:30 PM, Brzrk One <brzrk1@gmail.com
>     <mailto:brzrk1@gmail.com>> wrote:
>
>         Well, I'm looking at the output of our application code, which
>         is unfortunately both proprietary and much more complicated
>         than the command line utilities.
>
>         After loading this document with loadNonSeq(), getting the
>         warning, and continuing to do our processing (mostly adding
>         annotations), the output contains all blank pages.
>
>         The same processing code with load() produces output with
>         content on all the pages.
>
>         On Tue, Feb 23, 2016 at 4:22 PM, Tilman Hausherr
>         <THausherr@t-online.de <mailto:THausherr@t-online.de>> wrote:
>
>             Am 23.02.2016 um 22:19 schrieb Brzrk One:
>
>                 I get all that. I just don't see where in loadNonSeq()
>                 it is refusing to
>                 copy content.
>
>
>             Not at all. It is really just a warning when opening.
>
>             The refusal is in the command line utility source.
>
>
>             Tilman
>
>
>                 On Tue, Feb 23, 2016 at 3:40 PM, Tilman Hausherr
>                 <THausherr@t-online.de <mailto:THausherr@t-online.de>>
>                 wrote:
>
>                     Am 23.02.2016 um 21:33 schrieb Brzrk One:
>
>                         loadNonSeq() seems to respect it too?
>                         sneaking in a bogus
>                         AccessPermissions.canExtractContent() did not
>                         alter
>                         this.
>                         What does load() do that loadNonSeq() does
>                         not? (Or vice versa.)
>
>                     They use different parsing strategies.
>                     Additionally, a difference is that
>                     loadNonSeq immediately decrypts, and brings up the
>                     warning.
>
>
>                     Tilman
>
>
>                         On Tue, Feb 23, 2016 at 2:54 PM, Tilman
>                         Hausherr <THausherr@t-online.de
>                         <mailto:THausherr@t-online.de>>
>                         wrote:
>
>                         Am 23.02.2016 um 20:44 schrieb Brzrk One:
>
>                             The file is:
>
>                                 http://www.bmv.com.mx/docs-pub/infoifrs/infoifrs_588674_2015-01_1.pdf
>
>                                 The file is indeed protected against
>                                 text extraction. Our command line
>
>                             utilities respect this. The methods (of
>                             PDFTextStripper) ignore it, they
>                             expect you to handle it. See in the
>                             examples source code how to extract
>                             text.
>
>                             Tilman
>
>
>
>                             On Tue, Feb 23, 2016 at 12:05 PM, Tilman
>                             Hausherr <THausherr@t-online.de
>                             <mailto:THausherr@t-online.de>
>
>                                 wrote:
>
>                                 Am 23.02.2016 um 17:53 schrieb Brzrk One:
>
>                                     With pdfbox-1.8.11, using the
>                                     bottom-up parser (loadNonSeq) on a
>                                     document
>
>                                         that has security
>                                         ContentCopying: NotAllowed
>                                         results in:
>
>                                         org.apache.pdfbox.pdfparser.NonSequentialPDFParser
>                                         - PDF file
>                                         'some_temp_file.pdf' does not
>                                         allow extracting content
>
>                                         And the output pages are all
>                                         blank.
>
>                                         The top-down parser (load) has
>                                         no such issue.
>
>                                         Is there a workaround?
>
>
>                                         I looked in the source code,
>                                         this warning comes only in the non
>
>                                     sequential
>                                     parser. There's a similar error
>                                     message in the ExtractText command
>                                     line
>                                     utility ("You do not have
>                                     permission to extract text").
>
>                                     The best would be to upload the
>                                     file somewhere.
>
>                                     Tilman
>
>
>                                     ---------------------------------------------------------------------
>                                     To unsubscribe, e-mail:
>                                     users-unsubscribe@pdfbox.apache.org <mailto:users-unsubscribe@pdfbox.apache.org>
>                                     For additional commands, e-mail:
>                                     users-help@pdfbox.apache.org
>                                     <mailto:users-help@pdfbox.apache.org>
>
>
>
>                                     ---------------------------------------------------------------------
>
>                             To unsubscribe, e-mail:
>                             users-unsubscribe@pdfbox.apache.org
>                             <mailto:users-unsubscribe@pdfbox.apache.org>
>                             For additional commands, e-mail:
>                             users-help@pdfbox.apache.org
>                             <mailto:users-help@pdfbox.apache.org>
>
>
>
>                     ---------------------------------------------------------------------
>                     To unsubscribe, e-mail:
>                     users-unsubscribe@pdfbox.apache.org
>                     <mailto:users-unsubscribe@pdfbox.apache.org>
>                     For additional commands, e-mail:
>                     users-help@pdfbox.apache.org
>                     <mailto:users-help@pdfbox.apache.org>
>
>
>
>
>             ---------------------------------------------------------------------
>             To unsubscribe, e-mail:
>             users-unsubscribe@pdfbox.apache.org
>             <mailto:users-unsubscribe@pdfbox.apache.org>
>             For additional commands, e-mail:
>             users-help@pdfbox.apache.org
>             <mailto:users-help@pdfbox.apache.org>
>
>
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message