pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: How to flatedecode and find all acroform fields in a compressed PDF
Date Sun, 24 May 2015 09:38:07 GMT
Hi,

> Am 23.05.2015 um 16:37 schrieb Balaji Venkatamohan <bvenkata@tibco.com>:
> 
> Hi,
> 
> So AcroForms/Fields is an empty Array?
> 
> Yes, in the filled interview_compressed.pdf, the acroforms are not null but
> empty. Size of array is zero.
> 
> Also, I tried qpdf command line tool to compress the file interview.pdf and
> the resultant compressed file size of 1.6MB was no way near the file size
> of interview_compressed.pdf (21 KB).

would you think it's possible to get a similar PDF file or permission to use it internally
so we have a sample to look at a potential fix.

Although the PDF is not inline with the spec as Acrobat is able to handle it we could look
into getting a similar result.

BR
Maruan


> 
> Thanks,
> Balaji
> 
> On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <sahyoun@fileaffairs.de>
> wrote:
> 
>> Hi,
>> 
>>> Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <bvenkata@tibco.com>:
>>> 
>>> I opened the interview_compressed in notepad++ and did not see any
>>> 'Acroform' text anywhere.
>>> However, as Maruan suggested, I entered some data into what looks like
>> form
>>> fields of interview_compressed.pdf and saved it. When I opened this file
>> in
>>> notepad++, I did see 'Acroform' text in it. I also noticed an increase in
>>> file size from 21 KB to ~530 KB.
>>> 
>>> I then ran this filled saved compressed PDF in pdfdebugger.java and saw
>>> that the field values were getting stored but not under Acroform fields
>> but
>>> under Annotations.
>> 
>> 
>> 
>> So AcroForms/Fields is an empty Array?
>> 
>>> Please refer to this image:
>>> 
>>> http://imageshack.com/a/img540/9951/QGLDtS.jpg
>>> 
>>> So, whatever the compression technique was, it simply made all the
>> Acroform
>>> fields disappear from the original PDF but retained all annotations which
>>> also contain the interactive forms and this helped reduce the file size
>> so
>>> much? If this is the case, can pdfbox API also use similar compression
>>> technique to compress such a a huge file into a smaller one?
>>> 
>>> 
>>> 
>>> 
>>> On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <sahyoun@fileaffairs.de>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <THausherr@t-online.de
>>> :
>>>>> 
>>>>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
>>>>>> Hello,
>>>>>> 
>>>>>> I used PdfDebugger to make the internal PDF structure of the two
files
>>>> (1)
>>>>>> interview.pdf and (2) interview_compressed.pdf  visually available
>> and I
>>>>>> have uploaded my images to imageshack. Here are the four links:
>>>>>> 
>>>>>> http://imageshack.com/a/img538/8277/JghCpG.jpg
>>>>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
>>>>>> http://imageshack.com/a/img903/8644/mk15As.jpg
>>>>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
>>>>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
>>>>>> 
>>>>>> The first two links are from the internal structure of interview.pdf
>>>>>> (original uncompressed file)
>>>>>> The third and fourth links are from the internal structure of
>>>>>> interview_compressed.pdf (compressed file)
>>>>>> The fifth link compares the file sizes of the two files and as you
can
>>>> also
>>>>>> see, the difference is huge.
>>>>>> 
>>>>>> As you might notice, the file interview_compressed.pdf has no acroform
>>>>> 
>>>>> Indeed... but this is needed - from the spec:
>>>>> 
>>>>> "The contents and properties of a document’s interactive form shall
be
>>>> defined by an interactive form dictionary that shall be referenced from
>> the
>>>> AcroForm entry in the document catalogue (see 7.7.2, “Document
>> Catalog”).
>>>> Table 218 shows the contents of this dictionary."
>>>>> 
>>>> 
>>>> correct
>>>> 
>>>>>> fields listed even though opening the PDF in pdf reader allows me
to
>>>> enter
>>>>>> values in places which look like AcroForm fields and also save them.
>> Are
>>>>>> there any other PDF 'types' similar to Acroform fields which would
>>>> enable
>>>>>> users to fill data and which can be accessed in PdfBox APIs without
>>>> having
>>>>>> to go through PDAcrofield?
>>>>> 
>>>>> Yes, annotations... there are some common parts, but this is just a
>>>> vague observation from me, I'm not the acroform specialist.
>>>> 
>>>> from a first glance it looks like there are all entries necessary to
>> (re-)
>>>> generate the form fields. That's what's likely happening for this
>> document
>>>> in Adobe Reader. Would be interesting to see what's being save after the
>>>> forms has been filled out and saved using Acrobat. We'd need a test
>> form to
>>>> come up with an enhancement like this.
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>> 
>>>>> 
>>>>> What you should do: use NOTEPAD++ to look whether there's "/AcroForm"
>> in
>>>> the "compressed" file.
>>>>> - if it is missing, tell the client (or your boss) just that
>>>>> - if it isn't missing, then there's some problem in PDFBox (try also
>> the
>>>> loadNonSeq I mentioned earlier)
>>>>> 
>>>>> Tilman
>>>>> 
>>>>>> 
>>>>>> You can use qpdf , then use these options:
>>>>>> 
>>>>>> I will now try using this link to compress the original file.
>>>>>> 
>>>>>> Another strategy to think about - can your client generate a
>>>>>> non-confidential file, so that you can share it, and the "compressed"
>>>> file?
>>>>>> 
>>>>>> I wish I had direct communication with the clients but due to
>>>> bureaucracy,
>>>>>> I am having to go through multiple layers to get my message across
to
>>>> them.
>>>>>> I will share more information as soon as I have them.
>>>>>> 
>>>>>> PS: i sent these image links to my personal email first to make sure
>>>> that I
>>>>>> can open them. I could and so I am hoping you all could too. If you
>> are
>>>>>> unable to open them, please let me know.
>>>>>> 
>>>>>> Thanks,
>>>>>> Balaji
>>>>>> 
>>>>>> 
>>>>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <
>> THausherr@t-online.de
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Balaji Venkatamohan <bvenkata@tibco.com> hat am 20.
Mai 2015 um
>>>> 03:24
>>>>>>>>> geschrieben:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thank you for your pointers and sorry about the image.
I am
>>>> attaching it
>>>>>>>>> with this email.
>>>>>>>>> 
>>>>>>>>> The point I am trying to make is that the PDF, which
was
>> decompressed
>>>>>>>>> using
>>>>>>>>> WriteDecodedDoc, is smaller in size than the original
PDF given to
>>>> us by
>>>>>>>>> our customers.
>>>>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc
of PDFBox
>>>> did
>>>>>>>>> not
>>>>>>>>> have any PDAcroform fields whereas the decompressed PDF
given to us
>>>> by
>>>>>>>>> the
>>>>>>>>> customers does contain Acroform fields. Hence I wanted
to know how
>> to
>>>>>>>>> properly decompress the PDF using pdfbox APIs. The reason
why I was
>>>>>>>>> analyzing COSStream was to check if the decompression
of the
>>>> compressed
>>>>>>>>> PDF
>>>>>>>>> was happening correctly while using PDFBox APIs.
>>>>>>>>> I know it would have been difficult for you to help me
without the
>>>> actual
>>>>>>>>> PDFs. For that, I would like to thank you for your time
and
>> pointers.
>>>>>>>>> 
>>>>>>>> Maybe it's worth to try to share the file "visually" with
us. Open
>>>> both
>>>>>>>> files
>>>>>>>> (compressed and decompressed) with PDFDebugger [1] and post
a
>>>> screenshot
>>>>>>>> of both
>>>>>>>> somehwere (dropbox etc.) and share the link with us. Maybe
that
>> could
>>>>>>>> shed some
>>>>>>>> light on your issue.
>>>>>>>> 
>>>>>>> @Balaji: here's an example on how such a screenshot would look
like:
>>>>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
>>>>>>> 
>>>>>>> Tilman
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> BR
>>>>>>>> Andreas Lehmkühler
>>>>>>>> 
>>>>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>>>>>>>> 
>>>>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <
>>>> THausherr@t-online.de>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>>> The image doesn't appear in the mailing list.
>>>>>>>>>> 
>>>>>>>>>> This is all very confusing... /acroform is in the
document
>> catalog.
>>>> I
>>>>>>>>>> don't see how the page content stream is related
to it. The best
>> is
>>>> that
>>>>>>>>>> you either go through the source code, or read the
spec and then
>>>> look at
>>>>>>>>>> the pdf.
>>>>>>>>>> 
>>>>>>>>>> To find out what's going on, you'd have to start
from that
>> /acroform
>>>>>>>>>> entry
>>>>>>>>>> and then compare the two files.
>>>>>>>>>> 
>>>>>>>>>> It is really difficult to help you without the files.
The cause
>>>> could
>>>>>>>>>> be a
>>>>>>>>>> bug in pdfbox, or a malformed pdf...
>>>>>>>>>> 
>>>>>>>>>> Some more ideas:
>>>>>>>>>> - use loadNonSeq(file, null) instead of load(file)
>>>>>>>>>> - try the unreleased 2.0 version, that one has some
improvements
>> in
>>>> the
>>>>>>>>>> acroform stuff. Note that the API is different.
>>>>>>>>>> https://pdfbox.apache.org/download.cgi#scm
>>>>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>>>>>>>> 
>>>>>>>>>> If you still need help, one possibility would be
1) post the
>>>> smallest
>>>>>>>>>> possible code that fails, and 2) post a small part
of the raw PDF,
>>>> i.e.
>>>>>>>>>> the
>>>>>>>>>> objects relevant to the field in your code.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Tilman
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>>>>>>>> 
>>>>>>>>>> Moreover, for every page of the compressed PDF (there
are 3
>>>> pages), I
>>>>>>>>>>> tried getting the COSStream for each of the page
:
>>>>>>>>>>> 
>>>>>>>>>>> PDPage firstPage=(PDPage)
>>>>>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>>>>>>>>            pdStream=firstPage.getContents();
>>>>>>>>>>>            COSStream stream=pdStream.getStream();
>>>>>>>>>>> 
>>>>>>>>>>> In the above code snippet, the object stream,
when analyzed in
>>>> debug
>>>>>>>>>>> mode, has the following:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> The line from the compressed PDF as opened with
Notepad++ is :
>>>>>>>>>>> 
>>>>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>>>>>>>> 
>>>>>>>>>>> From this point on, using the COSStream object
for every page,
>> how
>>>>>>>>>>> can I
>>>>>>>>>>> decompress and find out the acroform fields given
that the
>>>>>>>>>>> unFilteredStream
>>>>>>>>>>> object is null for COSStream?
>>>>>>>>>>> ​
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan
<
>>>>>>>>>>> bvenkata@tibco.com
>>>>>>>>>>> <mailto:bvenkata@tibco.com>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>    Thank you for your response Tilman.
>>>>>>>>>>> 
>>>>>>>>>>>    I had previously tried using the WriteDecodedDoc
for my
>>>> compressed
>>>>>>>>>>>    PDF and I tried to get the number of acro
form fields present
>>>> in
>>>>>>>>>>> the output file generated by WriteDecodedDoc.
The API still
>> could
>>>>>>>>>>>    not find the acro form fields in the generated
decompressed
>>>> file.
>>>>>>>>>>>     Also the decompressed file generated is 75
KB which is far
>>>> less
>>>>>>>>>>>    than the original decompressed file which
I have (1.6 MB)
>>>> though I
>>>>>>>>>>>    could edit the acro form fields using acrobat
reader.
>>>>>>>>>>> 
>>>>>>>>>>>    Thanks,
>>>>>>>>>>>    Balaji
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>    On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>>>>>>>>>>    <THausherr@t-online.de <mailto:THausherr@t-online.de>>
>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>        Am 19.05.2015 um 21:35 schrieb Balaji
Venkatamohan:
>>>>>>>>>>> 
>>>>>>>>>>>            My question is: how do I flatedecode
a PDF so that I
>>>> can
>>>>>>>>>>>            find all the
>>>>>>>>>>>            acroform fields within it. ANy help
or pointers would
>>>> be
>>>>>>>>>>>            highly appreciated.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>        You could try the WriteDecodedDoc option
of the command
>>>> line
>>>>>>>>>>> app
>>>>>>>>>>> 
>>>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>>>>>>>> 
>>>>>>>>>>>        Maybe you can have further ideas by comparing
the two
>>>> files
>>>>>>>>>>>        with NOTEPAD++.... however the two files
might have their
>>>>>>>>>>>        objects in different order.
>>>>>>>>>>> 
>>>>>>>>>>>        Tilman
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>>>>        To unsubscribe, e-mail:
>>>> users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>        <mailto:users-unsubscribe@pdfbox.apache.org>
>>>>>>>>>>>        For additional commands, e-mail:
>>>> users-help@pdfbox.apache.org
>>>>>>>>>>>        <mailto:users-help@pdfbox.apache.org>
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>> 
>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <mailto:
>>>> users-unsubscribe@pdfbox.apache.org>
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org <mailto:
>>>> users-help@pdfbox.apache.org>
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message