pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: How to flatedecode and find all acroform fields in a compressed PDF
Date Wed, 27 May 2015 05:45:21 GMT
I just tested it. It also removes /Outlines and /Metadata and more 
important data from PDF files.

So your client can't share the PDF with us, but he shared it some website.

A little research shows that this website is owned by Lauri Lehtinen 
from Talinn, Estonia.
http://www.checkdomain.com/cgi-bin/checkdomain.pl?domain=pdfcompress.com
https://www.linkedin.com/in/laurilehtinen
https://twitter.com/laurii

I also tweeted him.

Tilman

Am 27.05.2015 um 03:06 schrieb Balaji Venkatamohan:
> Okay, I found out the online tool used by the customer to compress their
> PDF.
>
> It is : https://www.pdfcompress.com/
>
> I don't need to rely on the PDF sent by the customer because all PDFs that
> are available on the web, are compressed in the same manner by this tool,
> that is, it gets rid of all acro form fields during compression.
>
> For example, the f941 govt form available at this site:
> http://www.irs.gov/pub/irs-pdf/f941.pdf
> If we compress this using the online tool, the resultant file size is very
> low, which is good. However, there are no acro form fields in the
> compressed PDF.
>
> Thanks,
> Balaji
>
>
>
> On Sun, May 24, 2015 at 2:38 AM, Maruan Sahyoun <sahyoun@fileaffairs.de>
> wrote:
>
>> Hi,
>>
>>> Am 23.05.2015 um 16:37 schrieb Balaji Venkatamohan <bvenkata@tibco.com>:
>>>
>>> Hi,
>>>
>>> So AcroForms/Fields is an empty Array?
>>>
>>> Yes, in the filled interview_compressed.pdf, the acroforms are not null
>> but
>>> empty. Size of array is zero.
>>>
>>> Also, I tried qpdf command line tool to compress the file interview.pdf
>> and
>>> the resultant compressed file size of 1.6MB was no way near the file size
>>> of interview_compressed.pdf (21 KB).
>> would you think it's possible to get a similar PDF file or permission to
>> use it internally so we have a sample to look at a potential fix.
>>
>> Although the PDF is not inline with the spec as Acrobat is able to handle
>> it we could look into getting a similar result.
>>
>> BR
>> Maruan
>>
>>
>>> Thanks,
>>> Balaji
>>>
>>> On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <sahyoun@fileaffairs.de
>>>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>> Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <bvenkata@tibco.com
>>> :
>>>>> I opened the interview_compressed in notepad++ and did not see any
>>>>> 'Acroform' text anywhere.
>>>>> However, as Maruan suggested, I entered some data into what looks like
>>>> form
>>>>> fields of interview_compressed.pdf and saved it. When I opened this
>> file
>>>> in
>>>>> notepad++, I did see 'Acroform' text in it. I also noticed an increase
>> in
>>>>> file size from 21 KB to ~530 KB.
>>>>>
>>>>> I then ran this filled saved compressed PDF in pdfdebugger.java and saw
>>>>> that the field values were getting stored but not under Acroform fields
>>>> but
>>>>> under Annotations.
>>>>
>>>>
>>>> So AcroForms/Fields is an empty Array?
>>>>
>>>>> Please refer to this image:
>>>>>
>>>>> http://imageshack.com/a/img540/9951/QGLDtS.jpg
>>>>>
>>>>> So, whatever the compression technique was, it simply made all the
>>>> Acroform
>>>>> fields disappear from the original PDF but retained all annotations
>> which
>>>>> also contain the interactive forms and this helped reduce the file size
>>>> so
>>>>> much? If this is the case, can pdfbox API also use similar compression
>>>>> technique to compress such a a huge file into a smaller one?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <
>> sahyoun@fileaffairs.de>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <
>> THausherr@t-online.de
>>>>> :
>>>>>>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I used PdfDebugger to make the internal PDF structure of
the two
>> files
>>>>>> (1)
>>>>>>>> interview.pdf and (2) interview_compressed.pdf  visually
available
>>>> and I
>>>>>>>> have uploaded my images to imageshack. Here are the four
links:
>>>>>>>>
>>>>>>>> http://imageshack.com/a/img538/8277/JghCpG.jpg
>>>>>>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
>>>>>>>> http://imageshack.com/a/img903/8644/mk15As.jpg
>>>>>>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
>>>>>>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
>>>>>>>>
>>>>>>>> The first two links are from the internal structure of interview.pdf
>>>>>>>> (original uncompressed file)
>>>>>>>> The third and fourth links are from the internal structure
of
>>>>>>>> interview_compressed.pdf (compressed file)
>>>>>>>> The fifth link compares the file sizes of the two files and
as you
>> can
>>>>>> also
>>>>>>>> see, the difference is huge.
>>>>>>>>
>>>>>>>> As you might notice, the file interview_compressed.pdf has
no
>> acroform
>>>>>>> Indeed... but this is needed - from the spec:
>>>>>>>
>>>>>>> "The contents and properties of a document’s interactive form
shall
>> be
>>>>>> defined by an interactive form dictionary that shall be referenced
>> from
>>>> the
>>>>>> AcroForm entry in the document catalogue (see 7.7.2, “Document
>>>> Catalog”).
>>>>>> Table 218 shows the contents of this dictionary."
>>>>>> correct
>>>>>>
>>>>>>>> fields listed even though opening the PDF in pdf reader allows
me to
>>>>>> enter
>>>>>>>> values in places which look like AcroForm fields and also
save them.
>>>> Are
>>>>>>>> there any other PDF 'types' similar to Acroform fields which
would
>>>>>> enable
>>>>>>>> users to fill data and which can be accessed in PdfBox APIs
without
>>>>>> having
>>>>>>>> to go through PDAcrofield?
>>>>>>> Yes, annotations... there are some common parts, but this is
just a
>>>>>> vague observation from me, I'm not the acroform specialist.
>>>>>>
>>>>>> from a first glance it looks like there are all entries necessary
to
>>>> (re-)
>>>>>> generate the form fields. That's what's likely happening for this
>>>> document
>>>>>> in Adobe Reader. Would be interesting to see what's being save after
>> the
>>>>>> forms has been filled out and saved using Acrobat. We'd need a test
>>>> form to
>>>>>> come up with an enhancement like this.
>>>>>>
>>>>>> BR
>>>>>> Maruan
>>>>>>
>>>>>>
>>>>>>> What you should do: use NOTEPAD++ to look whether there's "/AcroForm"
>>>> in
>>>>>> the "compressed" file.
>>>>>>> - if it is missing, tell the client (or your boss) just that
>>>>>>> - if it isn't missing, then there's some problem in PDFBox (try
also
>>>> the
>>>>>> loadNonSeq I mentioned earlier)
>>>>>>> Tilman
>>>>>>>
>>>>>>>> You can use qpdf , then use these options:
>>>>>>>>
>>>>>>>> I will now try using this link to compress the original file.
>>>>>>>>
>>>>>>>> Another strategy to think about - can your client generate
a
>>>>>>>> non-confidential file, so that you can share it, and the
>> "compressed"
>>>>>> file?
>>>>>>>> I wish I had direct communication with the clients but due
to
>>>>>> bureaucracy,
>>>>>>>> I am having to go through multiple layers to get my message
across
>> to
>>>>>> them.
>>>>>>>> I will share more information as soon as I have them.
>>>>>>>>
>>>>>>>> PS: i sent these image links to my personal email first to
make sure
>>>>>> that I
>>>>>>>> can open them. I could and so I am hoping you all could too.
If you
>>>> are
>>>>>>>> unable to open them, please let me know.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Balaji
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <
>>>> THausherr@t-online.de
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Balaji Venkatamohan <bvenkata@tibco.com> hat
am 20. Mai 2015 um
>>>>>> 03:24
>>>>>>>>>>> geschrieben:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thank you for your pointers and sorry about the
image. I am
>>>>>> attaching it
>>>>>>>>>>> with this email.
>>>>>>>>>>>
>>>>>>>>>>> The point I am trying to make is that the PDF,
which was
>>>> decompressed
>>>>>>>>>>> using
>>>>>>>>>>> WriteDecodedDoc, is smaller in size than the
original PDF given
>> to
>>>>>> us by
>>>>>>>>>>> our customers.
>>>>>>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc
of
>> PDFBox
>>>>>> did
>>>>>>>>>>> not
>>>>>>>>>>> have any PDAcroform fields whereas the decompressed
PDF given to
>> us
>>>>>> by
>>>>>>>>>>> the
>>>>>>>>>>> customers does contain Acroform fields. Hence
I wanted to know
>> how
>>>> to
>>>>>>>>>>> properly decompress the PDF using pdfbox APIs.
The reason why I
>> was
>>>>>>>>>>> analyzing COSStream was to check if the decompression
of the
>>>>>> compressed
>>>>>>>>>>> PDF
>>>>>>>>>>> was happening correctly while using PDFBox APIs.
>>>>>>>>>>> I know it would have been difficult for you to
help me without
>> the
>>>>>> actual
>>>>>>>>>>> PDFs. For that, I would like to thank you for
your time and
>>>> pointers.
>>>>>>>>>> Maybe it's worth to try to share the file "visually"
with us. Open
>>>>>> both
>>>>>>>>>> files
>>>>>>>>>> (compressed and decompressed) with PDFDebugger [1]
and post a
>>>>>> screenshot
>>>>>>>>>> of both
>>>>>>>>>> somehwere (dropbox etc.) and share the link with
us. Maybe that
>>>> could
>>>>>>>>>> shed some
>>>>>>>>>> light on your issue.
>>>>>>>>>>
>>>>>>>>> @Balaji: here's an example on how such a screenshot would
look
>> like:
>>>>>>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
>>>>>>>>>
>>>>>>>>> Tilman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> BR
>>>>>>>>>> Andreas Lehmkühler
>>>>>>>>>>
>>>>>>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>>>>>>>>>>
>>>>>>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr
<
>>>>>> THausherr@t-online.de>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>> The image doesn't appear in the mailing list.
>>>>>>>>>>>>
>>>>>>>>>>>> This is all very confusing... /acroform is
in the document
>>>> catalog.
>>>>>> I
>>>>>>>>>>>> don't see how the page content stream is
related to it. The best
>>>> is
>>>>>> that
>>>>>>>>>>>> you either go through the source code, or
read the spec and then
>>>>>> look at
>>>>>>>>>>>> the pdf.
>>>>>>>>>>>>
>>>>>>>>>>>> To find out what's going on, you'd have to
start from that
>>>> /acroform
>>>>>>>>>>>> entry
>>>>>>>>>>>> and then compare the two files.
>>>>>>>>>>>>
>>>>>>>>>>>> It is really difficult to help you without
the files. The cause
>>>>>> could
>>>>>>>>>>>> be a
>>>>>>>>>>>> bug in pdfbox, or a malformed pdf...
>>>>>>>>>>>>
>>>>>>>>>>>> Some more ideas:
>>>>>>>>>>>> - use loadNonSeq(file, null) instead of load(file)
>>>>>>>>>>>> - try the unreleased 2.0 version, that one
has some improvements
>>>> in
>>>>>> the
>>>>>>>>>>>> acroform stuff. Note that the API is different.
>>>>>>>>>>>> https://pdfbox.apache.org/download.cgi#scm
>>>>>>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>>>>>>>>>>
>>>>>>>>>>>> If you still need help, one possibility would
be 1) post the
>>>>>> smallest
>>>>>>>>>>>> possible code that fails, and 2) post a small
part of the raw
>> PDF,
>>>>>> i.e.
>>>>>>>>>>>> the
>>>>>>>>>>>> objects relevant to the field in your code.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Tilman
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>>>>>>>>>>
>>>>>>>>>>>> Moreover, for every page of the compressed
PDF (there are 3
>>>>>> pages), I
>>>>>>>>>>>>> tried getting the COSStream for each
of the page :
>>>>>>>>>>>>>
>>>>>>>>>>>>> PDPage firstPage=(PDPage)
>>>>>>>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>>>>>>>>>>             pdStream=firstPage.getContents();
>>>>>>>>>>>>>             COSStream stream=pdStream.getStream();
>>>>>>>>>>>>>
>>>>>>>>>>>>> In the above code snippet, the object
stream, when analyzed in
>>>>>> debug
>>>>>>>>>>>>> mode, has the following:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The line from the compressed PDF as opened
with Notepad++ is :
>>>>>>>>>>>>>
>>>>>>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>>>>>>>>>>
>>>>>>>>>>>>>  From this point on, using the COSStream
object for every page,
>>>> how
>>>>>>>>>>>>> can I
>>>>>>>>>>>>> decompress and find out the acroform
fields given that the
>>>>>>>>>>>>> unFilteredStream
>>>>>>>>>>>>> object is null for COSStream?
>>>>>>>>>>>>> ​
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji
Venkatamohan <
>>>>>>>>>>>>> bvenkata@tibco.com
>>>>>>>>>>>>> <mailto:bvenkata@tibco.com>>
wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>     Thank you for your response Tilman.
>>>>>>>>>>>>>
>>>>>>>>>>>>>     I had previously tried using the
WriteDecodedDoc for my
>>>>>> compressed
>>>>>>>>>>>>>     PDF and I tried to get the number
of acro form fields
>> present
>>>>>> in
>>>>>>>>>>>>> the output file generated by WriteDecodedDoc.
The API still
>>>> could
>>>>>>>>>>>>>     not find the acro form fields in
the generated decompressed
>>>>>> file.
>>>>>>>>>>>>>      Also the decompressed file generated
is 75 KB which is far
>>>>>> less
>>>>>>>>>>>>>     than the original decompressed file
which I have (1.6 MB)
>>>>>> though I
>>>>>>>>>>>>>     could edit the acro form fields using
acrobat reader.
>>>>>>>>>>>>>
>>>>>>>>>>>>>     Thanks,
>>>>>>>>>>>>>     Balaji
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>     On Tue, May 19, 2015 at 1:18 PM,
Tilman Hausherr
>>>>>>>>>>>>>     <THausherr@t-online.de <mailto:THausherr@t-online.de>>
>>>> wrote:
>>>>>>>>>>>>>         Am 19.05.2015 um 21:35 schrieb
Balaji Venkatamohan:
>>>>>>>>>>>>>
>>>>>>>>>>>>>             My question is: how do I
flatedecode a PDF so that I
>>>>>> can
>>>>>>>>>>>>>             find all the
>>>>>>>>>>>>>             acroform fields within it.
ANy help or pointers
>> would
>>>>>> be
>>>>>>>>>>>>>             highly appreciated.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>         You could try the WriteDecodedDoc
option of the command
>>>>>> line
>>>>>>>>>>>>> app
>>>>>>>>>>>>>
>>>>>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>>>>>>>>>>         Maybe you can have further ideas
by comparing the two
>>>>>> files
>>>>>>>>>>>>>         with NOTEPAD++.... however the
two files might have
>> their
>>>>>>>>>>>>>         objects in different order.
>>>>>>>>>>>>>
>>>>>>>>>>>>>         Tilman
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>         To unsubscribe, e-mail:
>>>>>> users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>>>         <mailto:users-unsubscribe@pdfbox.apache.org>
>>>>>>>>>>>>>         For additional commands, e-mail:
>>>>>> users-help@pdfbox.apache.org
>>>>>>>>>>>>>         <mailto:users-help@pdfbox.apache.org>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>>
>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <mailto:
>>>>>> users-unsubscribe@pdfbox.apache.org>
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> <mailto:
>>>>>> users-help@pdfbox.apache.org>
>>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message