pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: How to flatedecode and find all acroform fields in a compressed PDF
Date Sat, 23 May 2015 06:58:22 GMT
Hi,

> Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <bvenkata@tibco.com>:
> 
> I opened the interview_compressed in notepad++ and did not see any
> 'Acroform' text anywhere.
> However, as Maruan suggested, I entered some data into what looks like form
> fields of interview_compressed.pdf and saved it. When I opened this file in
> notepad++, I did see 'Acroform' text in it. I also noticed an increase in
> file size from 21 KB to ~530 KB.
> 
> I then ran this filled saved compressed PDF in pdfdebugger.java and saw
> that the field values were getting stored but not under Acroform fields but
> under Annotations.



So AcroForms/Fields is an empty Array?

> Please refer to this image:
> 
> http://imageshack.com/a/img540/9951/QGLDtS.jpg
> 
> So, whatever the compression technique was, it simply made all the Acroform
> fields disappear from the original PDF but retained all annotations which
> also contain the interactive forms and this helped reduce the file size so
> much? If this is the case, can pdfbox API also use similar compression
> technique to compress such a a huge file into a smaller one?
> 
> 
> 
> 
> On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <sahyoun@fileaffairs.de>
> wrote:
> 
>> Hi,
>> 
>>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <THausherr@t-online.de>:
>>> 
>>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
>>>> Hello,
>>>> 
>>>> I used PdfDebugger to make the internal PDF structure of the two files
>> (1)
>>>> interview.pdf and (2) interview_compressed.pdf  visually available and I
>>>> have uploaded my images to imageshack. Here are the four links:
>>>> 
>>>> http://imageshack.com/a/img538/8277/JghCpG.jpg
>>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
>>>> http://imageshack.com/a/img903/8644/mk15As.jpg
>>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
>>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
>>>> 
>>>> The first two links are from the internal structure of interview.pdf
>>>> (original uncompressed file)
>>>> The third and fourth links are from the internal structure of
>>>> interview_compressed.pdf (compressed file)
>>>> The fifth link compares the file sizes of the two files and as you can
>> also
>>>> see, the difference is huge.
>>>> 
>>>> As you might notice, the file interview_compressed.pdf has no acroform
>>> 
>>> Indeed... but this is needed - from the spec:
>>> 
>>> "The contents and properties of a document’s interactive form shall be
>> defined by an interactive form dictionary that shall be referenced from the
>> AcroForm entry in the document catalogue (see 7.7.2, “Document Catalog”).
>> Table 218 shows the contents of this dictionary."
>>> 
>> 
>> correct
>> 
>>>> fields listed even though opening the PDF in pdf reader allows me to
>> enter
>>>> values in places which look like AcroForm fields and also save them. Are
>>>> there any other PDF 'types' similar to Acroform fields which would
>> enable
>>>> users to fill data and which can be accessed in PdfBox APIs without
>> having
>>>> to go through PDAcrofield?
>>> 
>>> Yes, annotations... there are some common parts, but this is just a
>> vague observation from me, I'm not the acroform specialist.
>> 
>> from a first glance it looks like there are all entries necessary to (re-)
>> generate the form fields. That's what's likely happening for this document
>> in Adobe Reader. Would be interesting to see what's being save after the
>> forms has been filled out and saved using Acrobat. We'd need a test form to
>> come up with an enhancement like this.
>> 
>> BR
>> Maruan
>> 
>> 
>>> 
>>> What you should do: use NOTEPAD++ to look whether there's "/AcroForm" in
>> the "compressed" file.
>>> - if it is missing, tell the client (or your boss) just that
>>> - if it isn't missing, then there's some problem in PDFBox (try also the
>> loadNonSeq I mentioned earlier)
>>> 
>>> Tilman
>>> 
>>>> 
>>>> You can use qpdf , then use these options:
>>>> 
>>>> I will now try using this link to compress the original file.
>>>> 
>>>> Another strategy to think about - can your client generate a
>>>> non-confidential file, so that you can share it, and the "compressed"
>> file?
>>>> 
>>>> I wish I had direct communication with the clients but due to
>> bureaucracy,
>>>> I am having to go through multiple layers to get my message across to
>> them.
>>>> I will share more information as soon as I have them.
>>>> 
>>>> PS: i sent these image links to my personal email first to make sure
>> that I
>>>> can open them. I could and so I am hoping you all could too. If you are
>>>> unable to open them, please let me know.
>>>> 
>>>> Thanks,
>>>> Balaji
>>>> 
>>>> 
>>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <THausherr@t-online.de
>>> 
>>>> wrote:
>>>> 
>>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Balaji Venkatamohan <bvenkata@tibco.com> hat am 20. Mai 2015
um
>> 03:24
>>>>>>> geschrieben:
>>>>>>> 
>>>>>>> 
>>>>>>> Thank you for your pointers and sorry about the image. I am
>> attaching it
>>>>>>> with this email.
>>>>>>> 
>>>>>>> The point I am trying to make is that the PDF, which was decompressed
>>>>>>> using
>>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given
to
>> us by
>>>>>>> our customers.
>>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox
>> did
>>>>>>> not
>>>>>>> have any PDAcroform fields whereas the decompressed PDF given
to us
>> by
>>>>>>> the
>>>>>>> customers does contain Acroform fields. Hence I wanted to know
how to
>>>>>>> properly decompress the PDF using pdfbox APIs. The reason why
I was
>>>>>>> analyzing COSStream was to check if the decompression of the
>> compressed
>>>>>>> PDF
>>>>>>> was happening correctly while using PDFBox APIs.
>>>>>>> I know it would have been difficult for you to help me without
the
>> actual
>>>>>>> PDFs. For that, I would like to thank you for your time and pointers.
>>>>>>> 
>>>>>> Maybe it's worth to try to share the file "visually" with us. Open
>> both
>>>>>> files
>>>>>> (compressed and decompressed) with PDFDebugger [1] and post a
>> screenshot
>>>>>> of both
>>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that could
>>>>>> shed some
>>>>>> light on your issue.
>>>>>> 
>>>>> @Balaji: here's an example on how such a screenshot would look like:
>>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
>>>>> 
>>>>> Tilman
>>>>> 
>>>>> 
>>>>> 
>>>>>> BR
>>>>>> Andreas Lehmkühler
>>>>>> 
>>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>>>>>> 
>>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <
>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>>> The image doesn't appear in the mailing list.
>>>>>>>> 
>>>>>>>> This is all very confusing... /acroform is in the document
catalog.
>> I
>>>>>>>> don't see how the page content stream is related to it. The
best is
>> that
>>>>>>>> you either go through the source code, or read the spec and
then
>> look at
>>>>>>>> the pdf.
>>>>>>>> 
>>>>>>>> To find out what's going on, you'd have to start from that
/acroform
>>>>>>>> entry
>>>>>>>> and then compare the two files.
>>>>>>>> 
>>>>>>>> It is really difficult to help you without the files. The
cause
>> could
>>>>>>>> be a
>>>>>>>> bug in pdfbox, or a malformed pdf...
>>>>>>>> 
>>>>>>>> Some more ideas:
>>>>>>>> - use loadNonSeq(file, null) instead of load(file)
>>>>>>>> - try the unreleased 2.0 version, that one has some improvements
in
>> the
>>>>>>>> acroform stuff. Note that the API is different.
>>>>>>>> https://pdfbox.apache.org/download.cgi#scm
>>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>>>>>> 
>>>>>>>> If you still need help, one possibility would be 1) post
the
>> smallest
>>>>>>>> possible code that fails, and 2) post a small part of the
raw PDF,
>> i.e.
>>>>>>>> the
>>>>>>>> objects relevant to the field in your code.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Tilman
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>>>>>> 
>>>>>>>> Moreover, for every page of the compressed PDF (there are
3
>> pages), I
>>>>>>>>> tried getting the COSStream for each of the page :
>>>>>>>>> 
>>>>>>>>> PDPage firstPage=(PDPage)
>>>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>>>>>>             pdStream=firstPage.getContents();
>>>>>>>>>             COSStream stream=pdStream.getStream();
>>>>>>>>> 
>>>>>>>>> In the above code snippet, the object stream, when analyzed
in
>> debug
>>>>>>>>> mode, has the following:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The line from the compressed PDF as opened with Notepad++
is :
>>>>>>>>> 
>>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>>>>>> 
>>>>>>>>> From this point on, using the COSStream object for every
page, how
>>>>>>>>> can I
>>>>>>>>> decompress and find out the acroform fields given that
the
>>>>>>>>> unFilteredStream
>>>>>>>>> object is null for COSStream?
>>>>>>>>> ​
>>>>>>>>> 
>>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan
<
>>>>>>>>> bvenkata@tibco.com
>>>>>>>>> <mailto:bvenkata@tibco.com>> wrote:
>>>>>>>>> 
>>>>>>>>>     Thank you for your response Tilman.
>>>>>>>>> 
>>>>>>>>>     I had previously tried using the WriteDecodedDoc
for my
>> compressed
>>>>>>>>>     PDF and I tried to get the number of acro form fields
present
>> in
>>>>>>>>>  the output file generated by WriteDecodedDoc. The API
still could
>>>>>>>>>     not find the acro form fields in the generated decompressed
>> file.
>>>>>>>>>      Also the decompressed file generated is 75 KB which
is far
>> less
>>>>>>>>>     than the original decompressed file which I have
(1.6 MB)
>> though I
>>>>>>>>>     could edit the acro form fields using acrobat reader.
>>>>>>>>> 
>>>>>>>>>     Thanks,
>>>>>>>>>     Balaji
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>     On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>>>>>>>>     <THausherr@t-online.de <mailto:THausherr@t-online.de>>
wrote:
>>>>>>>>> 
>>>>>>>>>         Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>>>>>>>> 
>>>>>>>>>             My question is: how do I flatedecode a PDF
so that I
>> can
>>>>>>>>>             find all the
>>>>>>>>>             acroform fields within it. ANy help or pointers
would
>> be
>>>>>>>>>             highly appreciated.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>         You could try the WriteDecodedDoc option of the
command
>> line
>>>>>>>>> app
>>>>>>>>> 
>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>>>>>> 
>>>>>>>>>         Maybe you can have further ideas by comparing
the two
>> files
>>>>>>>>>         with NOTEPAD++.... however the two files might
have their
>>>>>>>>>         objects in different order.
>>>>>>>>> 
>>>>>>>>>         Tilman
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>>         To unsubscribe, e-mail:
>> users-unsubscribe@pdfbox.apache.org
>>>>>>>>>         <mailto:users-unsubscribe@pdfbox.apache.org>
>>>>>>>>>         For additional commands, e-mail:
>> users-help@pdfbox.apache.org
>>>>>>>>>         <mailto:users-help@pdfbox.apache.org>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>> 
>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>> 
>>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <mailto:
>> users-unsubscribe@pdfbox.apache.org>
>>> For additional commands, e-mail: users-help@pdfbox.apache.org <mailto:
>> users-help@pdfbox.apache.org>
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message