pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: How to flatedecode and find all acroform fields in a compressed PDF
Date Fri, 22 May 2015 19:54:29 GMT
Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
> Hello,
>
> I used PdfDebugger to make the internal PDF structure of the two files (1)
> interview.pdf and (2) interview_compressed.pdf  visually available and I
> have uploaded my images to imageshack. Here are the four links:
>
> http://imageshack.com/a/img538/8277/JghCpG.jpg
> http://imageshack.com/a/img909/6140/KsYNGR.jpg
> http://imageshack.com/a/img903/8644/mk15As.jpg
> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
>
> The first two links are from the internal structure of interview.pdf
> (original uncompressed file)
> The third and fourth links are from the internal structure of
> interview_compressed.pdf (compressed file)
> The fifth link compares the file sizes of the two files and as you can also
> see, the difference is huge.
>
> As you might notice, the file interview_compressed.pdf has no acroform

Indeed... but this is needed - from the spec:

"The contents and properties of a document’s interactive form shall be 
defined by an interactive form dictionary that shall be referenced from 
the AcroForm entry in the document catalogue (see 7.7.2, “Document 
Catalog”). Table 218 shows the contents of this dictionary."

> fields listed even though opening the PDF in pdf reader allows me to enter
> values in places which look like AcroForm fields and also save them. Are
> there any other PDF 'types' similar to Acroform fields which would enable
> users to fill data and which can be accessed in PdfBox APIs without having
> to go through PDAcrofield?

Yes, annotations... there are some common parts, but this is just a 
vague observation from me, I'm not the acroform specialist.

What you should do: use NOTEPAD++ to look whether there's "/AcroForm" in 
the "compressed" file.
- if it is missing, tell the client (or your boss) just that
- if it isn't missing, then there's some problem in PDFBox (try also the 
loadNonSeq I mentioned earlier)

Tilman

>
> You can use qpdf , then use these options:
>
> I will now try using this link to compress the original file.
>
> Another strategy to think about - can your client generate a
> non-confidential file, so that you can share it, and the "compressed" file?
>
> I wish I had direct communication with the clients but due to bureaucracy,
> I am having to go through multiple layers to get my message across to them.
> I will share more information as soon as I have them.
>
> PS: i sent these image links to my personal email first to make sure that I
> can open them. I could and so I am hoping you all could too. If you are
> unable to open them, please let me know.
>
> Thanks,
> Balaji
>
>
> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
>>
>>> Hi,
>>>
>>>   Balaji Venkatamohan <bvenkata@tibco.com> hat am 20. Mai 2015 um 03:24
>>>> geschrieben:
>>>>
>>>>
>>>> Thank you for your pointers and sorry about the image. I am attaching it
>>>> with this email.
>>>>
>>>> The point I am trying to make is that the PDF, which was decompressed
>>>> using
>>>> WriteDecodedDoc, is smaller in size than the original PDF given to us by
>>>> our customers.
>>>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox did
>>>> not
>>>> have any PDAcroform fields whereas the decompressed PDF given to us by
>>>> the
>>>> customers does contain Acroform fields. Hence I wanted to know how to
>>>> properly decompress the PDF using pdfbox APIs. The reason why I was
>>>> analyzing COSStream was to check if the decompression of the compressed
>>>> PDF
>>>> was happening correctly while using PDFBox APIs.
>>>> I know it would have been difficult for you to help me without the actual
>>>> PDFs. For that, I would like to thank you for your time and pointers.
>>>>
>>> Maybe it's worth to try to share the file "visually" with us. Open both
>>> files
>>> (compressed and decompressed) with PDFDebugger [1] and post a screenshot
>>> of both
>>> somehwere (dropbox etc.) and share the link with us. Maybe that could
>>> shed some
>>> light on your issue.
>>>
>> @Balaji: here's an example on how such a screenshot would look like:
>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
>>
>> Tilman
>>
>>
>>
>>> BR
>>> Andreas Lehmkühler
>>>
>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>>>
>>>   On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <THausherr@t-online.de>
>>>> wrote:
>>>>
>>>>   Hi,
>>>>> The image doesn't appear in the mailing list.
>>>>>
>>>>> This is all very confusing... /acroform is in the document catalog. I
>>>>> don't see how the page content stream is related to it. The best is that
>>>>> you either go through the source code, or read the spec and then look
at
>>>>> the pdf.
>>>>>
>>>>> To find out what's going on, you'd have to start from that /acroform
>>>>> entry
>>>>> and then compare the two files.
>>>>>
>>>>> It is really difficult to help you without the files. The cause could
>>>>> be a
>>>>> bug in pdfbox, or a malformed pdf...
>>>>>
>>>>> Some more ideas:
>>>>> - use loadNonSeq(file, null) instead of load(file)
>>>>> - try the unreleased 2.0 version, that one has some improvements in the
>>>>> acroform stuff. Note that the API is different.
>>>>> https://pdfbox.apache.org/download.cgi#scm
>>>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>>>
>>>>> If you still need help, one possibility would be 1) post the smallest
>>>>> possible code that fails, and 2) post a small part of the raw PDF, i.e.
>>>>> the
>>>>> objects relevant to the field in your code.
>>>>>
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>>>
>>>>>   Moreover, for every page of the compressed PDF (there are 3 pages),
I
>>>>>> tried getting the COSStream for each of the page :
>>>>>>
>>>>>> PDPage firstPage=(PDPage)
>>>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>>>               pdStream=firstPage.getContents();
>>>>>>               COSStream stream=pdStream.getStream();
>>>>>>
>>>>>> In the above code snippet, the object stream, when analyzed in debug
>>>>>> mode, has the following:
>>>>>>
>>>>>>
>>>>>> The line from the compressed PDF as opened with Notepad++ is :
>>>>>>
>>>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>>>
>>>>>>   From this point on, using the COSStream object for every page,
how
>>>>>> can I
>>>>>> decompress and find out the acroform fields given that the
>>>>>> unFilteredStream
>>>>>> object is null for COSStream?
>>>>>> ​
>>>>>>
>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
>>>>>> bvenkata@tibco.com
>>>>>> <mailto:bvenkata@tibco.com>> wrote:
>>>>>>
>>>>>>       Thank you for your response Tilman.
>>>>>>
>>>>>>       I had previously tried using the WriteDecodedDoc for my compressed
>>>>>>       PDF and I tried to get the number of acro form fields present
in
>>>>>>    the output file generated by WriteDecodedDoc. The API still could
>>>>>>       not find the acro form fields in the generated decompressed
file.
>>>>>>        Also the decompressed file generated is 75 KB which is far
less
>>>>>>       than the original decompressed file which I have (1.6 MB) though
I
>>>>>>       could edit the acro form fields using acrobat reader.
>>>>>>
>>>>>>       Thanks,
>>>>>>       Balaji
>>>>>>
>>>>>>
>>>>>>
>>>>>>       On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>>>>>       <THausherr@t-online.de <mailto:THausherr@t-online.de>>
wrote:
>>>>>>
>>>>>>           Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>>>>>
>>>>>>               My question is: how do I flatedecode a PDF so that
I can
>>>>>>               find all the
>>>>>>               acroform fields within it. ANy help or pointers would
be
>>>>>>               highly appreciated.
>>>>>>
>>>>>>
>>>>>>           You could try the WriteDecodedDoc option of the command
line
>>>>>> app
>>>>>>           https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>>>
>>>>>>           Maybe you can have further ideas by comparing the two files
>>>>>>           with NOTEPAD++.... however the two files might have their
>>>>>>           objects in different order.
>>>>>>
>>>>>>           Tilman
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>           To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>           <mailto:users-unsubscribe@pdfbox.apache.org>
>>>>>>           For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>           <mailto:users-help@pdfbox.apache.org>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message