pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Balaji Venkatamohan <bvenk...@tibco.com>
Subject Re: How to flatedecode and find all acroform fields in a compressed PDF
Date Wed, 27 May 2015 01:06:00 GMT
Okay, I found out the online tool used by the customer to compress their
PDF.

It is : https://www.pdfcompress.com/

I don't need to rely on the PDF sent by the customer because all PDFs that
are available on the web, are compressed in the same manner by this tool,
that is, it gets rid of all acro form fields during compression.

For example, the f941 govt form available at this site:
http://www.irs.gov/pub/irs-pdf/f941.pdf
If we compress this using the online tool, the resultant file size is very
low, which is good. However, there are no acro form fields in the
compressed PDF.

Thanks,
Balaji



On Sun, May 24, 2015 at 2:38 AM, Maruan Sahyoun <sahyoun@fileaffairs.de>
wrote:

> Hi,
>
> > Am 23.05.2015 um 16:37 schrieb Balaji Venkatamohan <bvenkata@tibco.com>:
> >
> > Hi,
> >
> > So AcroForms/Fields is an empty Array?
> >
> > Yes, in the filled interview_compressed.pdf, the acroforms are not null
> but
> > empty. Size of array is zero.
> >
> > Also, I tried qpdf command line tool to compress the file interview.pdf
> and
> > the resultant compressed file size of 1.6MB was no way near the file size
> > of interview_compressed.pdf (21 KB).
>
> would you think it's possible to get a similar PDF file or permission to
> use it internally so we have a sample to look at a potential fix.
>
> Although the PDF is not inline with the spec as Acrobat is able to handle
> it we could look into getting a similar result.
>
> BR
> Maruan
>
>
> >
> > Thanks,
> > Balaji
> >
> > On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <sahyoun@fileaffairs.de
> >
> > wrote:
> >
> >> Hi,
> >>
> >>> Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <bvenkata@tibco.com
> >:
> >>>
> >>> I opened the interview_compressed in notepad++ and did not see any
> >>> 'Acroform' text anywhere.
> >>> However, as Maruan suggested, I entered some data into what looks like
> >> form
> >>> fields of interview_compressed.pdf and saved it. When I opened this
> file
> >> in
> >>> notepad++, I did see 'Acroform' text in it. I also noticed an increase
> in
> >>> file size from 21 KB to ~530 KB.
> >>>
> >>> I then ran this filled saved compressed PDF in pdfdebugger.java and saw
> >>> that the field values were getting stored but not under Acroform fields
> >> but
> >>> under Annotations.
> >>
> >>
> >>
> >> So AcroForms/Fields is an empty Array?
> >>
> >>> Please refer to this image:
> >>>
> >>> http://imageshack.com/a/img540/9951/QGLDtS.jpg
> >>>
> >>> So, whatever the compression technique was, it simply made all the
> >> Acroform
> >>> fields disappear from the original PDF but retained all annotations
> which
> >>> also contain the interactive forms and this helped reduce the file size
> >> so
> >>> much? If this is the case, can pdfbox API also use similar compression
> >>> technique to compress such a a huge file into a smaller one?
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <
> sahyoun@fileaffairs.de>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <
> THausherr@t-online.de
> >>> :
> >>>>>
> >>>>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
> >>>>>> Hello,
> >>>>>>
> >>>>>> I used PdfDebugger to make the internal PDF structure of the
two
> files
> >>>> (1)
> >>>>>> interview.pdf and (2) interview_compressed.pdf  visually available
> >> and I
> >>>>>> have uploaded my images to imageshack. Here are the four links:
> >>>>>>
> >>>>>> http://imageshack.com/a/img538/8277/JghCpG.jpg
> >>>>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
> >>>>>> http://imageshack.com/a/img903/8644/mk15As.jpg
> >>>>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
> >>>>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
> >>>>>>
> >>>>>> The first two links are from the internal structure of interview.pdf
> >>>>>> (original uncompressed file)
> >>>>>> The third and fourth links are from the internal structure of
> >>>>>> interview_compressed.pdf (compressed file)
> >>>>>> The fifth link compares the file sizes of the two files and
as you
> can
> >>>> also
> >>>>>> see, the difference is huge.
> >>>>>>
> >>>>>> As you might notice, the file interview_compressed.pdf has no
> acroform
> >>>>>
> >>>>> Indeed... but this is needed - from the spec:
> >>>>>
> >>>>> "The contents and properties of a document’s interactive form
shall
> be
> >>>> defined by an interactive form dictionary that shall be referenced
> from
> >> the
> >>>> AcroForm entry in the document catalogue (see 7.7.2, “Document
> >> Catalog”).
> >>>> Table 218 shows the contents of this dictionary."
> >>>>>
> >>>>
> >>>> correct
> >>>>
> >>>>>> fields listed even though opening the PDF in pdf reader allows
me to
> >>>> enter
> >>>>>> values in places which look like AcroForm fields and also save
them.
> >> Are
> >>>>>> there any other PDF 'types' similar to Acroform fields which
would
> >>>> enable
> >>>>>> users to fill data and which can be accessed in PdfBox APIs
without
> >>>> having
> >>>>>> to go through PDAcrofield?
> >>>>>
> >>>>> Yes, annotations... there are some common parts, but this is just
a
> >>>> vague observation from me, I'm not the acroform specialist.
> >>>>
> >>>> from a first glance it looks like there are all entries necessary to
> >> (re-)
> >>>> generate the form fields. That's what's likely happening for this
> >> document
> >>>> in Adobe Reader. Would be interesting to see what's being save after
> the
> >>>> forms has been filled out and saved using Acrobat. We'd need a test
> >> form to
> >>>> come up with an enhancement like this.
> >>>>
> >>>> BR
> >>>> Maruan
> >>>>
> >>>>
> >>>>>
> >>>>> What you should do: use NOTEPAD++ to look whether there's "/AcroForm"
> >> in
> >>>> the "compressed" file.
> >>>>> - if it is missing, tell the client (or your boss) just that
> >>>>> - if it isn't missing, then there's some problem in PDFBox (try
also
> >> the
> >>>> loadNonSeq I mentioned earlier)
> >>>>>
> >>>>> Tilman
> >>>>>
> >>>>>>
> >>>>>> You can use qpdf , then use these options:
> >>>>>>
> >>>>>> I will now try using this link to compress the original file.
> >>>>>>
> >>>>>> Another strategy to think about - can your client generate a
> >>>>>> non-confidential file, so that you can share it, and the
> "compressed"
> >>>> file?
> >>>>>>
> >>>>>> I wish I had direct communication with the clients but due to
> >>>> bureaucracy,
> >>>>>> I am having to go through multiple layers to get my message
across
> to
> >>>> them.
> >>>>>> I will share more information as soon as I have them.
> >>>>>>
> >>>>>> PS: i sent these image links to my personal email first to make
sure
> >>>> that I
> >>>>>> can open them. I could and so I am hoping you all could too.
If you
> >> are
> >>>>>> unable to open them, please let me know.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Balaji
> >>>>>>
> >>>>>>
> >>>>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <
> >> THausherr@t-online.de
> >>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Balaji Venkatamohan <bvenkata@tibco.com> hat am
20. Mai 2015 um
> >>>> 03:24
> >>>>>>>>> geschrieben:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thank you for your pointers and sorry about the
image. I am
> >>>> attaching it
> >>>>>>>>> with this email.
> >>>>>>>>>
> >>>>>>>>> The point I am trying to make is that the PDF, which
was
> >> decompressed
> >>>>>>>>> using
> >>>>>>>>> WriteDecodedDoc, is smaller in size than the original
PDF given
> to
> >>>> us by
> >>>>>>>>> our customers.
> >>>>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc
of
> PDFBox
> >>>> did
> >>>>>>>>> not
> >>>>>>>>> have any PDAcroform fields whereas the decompressed
PDF given to
> us
> >>>> by
> >>>>>>>>> the
> >>>>>>>>> customers does contain Acroform fields. Hence I
wanted to know
> how
> >> to
> >>>>>>>>> properly decompress the PDF using pdfbox APIs. The
reason why I
> was
> >>>>>>>>> analyzing COSStream was to check if the decompression
of the
> >>>> compressed
> >>>>>>>>> PDF
> >>>>>>>>> was happening correctly while using PDFBox APIs.
> >>>>>>>>> I know it would have been difficult for you to help
me without
> the
> >>>> actual
> >>>>>>>>> PDFs. For that, I would like to thank you for your
time and
> >> pointers.
> >>>>>>>>>
> >>>>>>>> Maybe it's worth to try to share the file "visually"
with us. Open
> >>>> both
> >>>>>>>> files
> >>>>>>>> (compressed and decompressed) with PDFDebugger [1] and
post a
> >>>> screenshot
> >>>>>>>> of both
> >>>>>>>> somehwere (dropbox etc.) and share the link with us.
Maybe that
> >> could
> >>>>>>>> shed some
> >>>>>>>> light on your issue.
> >>>>>>>>
> >>>>>>> @Balaji: here's an example on how such a screenshot would
look
> like:
> >>>>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
> >>>>>>>
> >>>>>>> Tilman
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> BR
> >>>>>>>> Andreas Lehmkühler
> >>>>>>>>
> >>>>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
> >>>>>>>>
> >>>>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <
> >>>> THausherr@t-online.de>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>> The image doesn't appear in the mailing list.
> >>>>>>>>>>
> >>>>>>>>>> This is all very confusing... /acroform is in
the document
> >> catalog.
> >>>> I
> >>>>>>>>>> don't see how the page content stream is related
to it. The best
> >> is
> >>>> that
> >>>>>>>>>> you either go through the source code, or read
the spec and then
> >>>> look at
> >>>>>>>>>> the pdf.
> >>>>>>>>>>
> >>>>>>>>>> To find out what's going on, you'd have to start
from that
> >> /acroform
> >>>>>>>>>> entry
> >>>>>>>>>> and then compare the two files.
> >>>>>>>>>>
> >>>>>>>>>> It is really difficult to help you without the
files. The cause
> >>>> could
> >>>>>>>>>> be a
> >>>>>>>>>> bug in pdfbox, or a malformed pdf...
> >>>>>>>>>>
> >>>>>>>>>> Some more ideas:
> >>>>>>>>>> - use loadNonSeq(file, null) instead of load(file)
> >>>>>>>>>> - try the unreleased 2.0 version, that one has
some improvements
> >> in
> >>>> the
> >>>>>>>>>> acroform stuff. Note that the API is different.
> >>>>>>>>>> https://pdfbox.apache.org/download.cgi#scm
> >>>>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
> >>>>>>>>>>
> >>>>>>>>>> If you still need help, one possibility would
be 1) post the
> >>>> smallest
> >>>>>>>>>> possible code that fails, and 2) post a small
part of the raw
> PDF,
> >>>> i.e.
> >>>>>>>>>> the
> >>>>>>>>>> objects relevant to the field in your code.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Tilman
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
> >>>>>>>>>>
> >>>>>>>>>> Moreover, for every page of the compressed PDF
(there are 3
> >>>> pages), I
> >>>>>>>>>>> tried getting the COSStream for each of
the page :
> >>>>>>>>>>>
> >>>>>>>>>>> PDPage firstPage=(PDPage)
> >>>>>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
> >>>>>>>>>>>            pdStream=firstPage.getContents();
> >>>>>>>>>>>            COSStream stream=pdStream.getStream();
> >>>>>>>>>>>
> >>>>>>>>>>> In the above code snippet, the object stream,
when analyzed in
> >>>> debug
> >>>>>>>>>>> mode, has the following:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> The line from the compressed PDF as opened
with Notepad++ is :
> >>>>>>>>>>>
> >>>>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
> >>>>>>>>>>>
> >>>>>>>>>>> From this point on, using the COSStream
object for every page,
> >> how
> >>>>>>>>>>> can I
> >>>>>>>>>>> decompress and find out the acroform fields
given that the
> >>>>>>>>>>> unFilteredStream
> >>>>>>>>>>> object is null for COSStream?
> >>>>>>>>>>> ​
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji
Venkatamohan <
> >>>>>>>>>>> bvenkata@tibco.com
> >>>>>>>>>>> <mailto:bvenkata@tibco.com>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>    Thank you for your response Tilman.
> >>>>>>>>>>>
> >>>>>>>>>>>    I had previously tried using the WriteDecodedDoc
for my
> >>>> compressed
> >>>>>>>>>>>    PDF and I tried to get the number of
acro form fields
> present
> >>>> in
> >>>>>>>>>>> the output file generated by WriteDecodedDoc.
The API still
> >> could
> >>>>>>>>>>>    not find the acro form fields in the
generated decompressed
> >>>> file.
> >>>>>>>>>>>     Also the decompressed file generated
is 75 KB which is far
> >>>> less
> >>>>>>>>>>>    than the original decompressed file which
I have (1.6 MB)
> >>>> though I
> >>>>>>>>>>>    could edit the acro form fields using
acrobat reader.
> >>>>>>>>>>>
> >>>>>>>>>>>    Thanks,
> >>>>>>>>>>>    Balaji
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>    On Tue, May 19, 2015 at 1:18 PM, Tilman
Hausherr
> >>>>>>>>>>>    <THausherr@t-online.de <mailto:THausherr@t-online.de>>
> >> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>        Am 19.05.2015 um 21:35 schrieb Balaji
Venkatamohan:
> >>>>>>>>>>>
> >>>>>>>>>>>            My question is: how do I flatedecode
a PDF so that I
> >>>> can
> >>>>>>>>>>>            find all the
> >>>>>>>>>>>            acroform fields within it. ANy
help or pointers
> would
> >>>> be
> >>>>>>>>>>>            highly appreciated.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>        You could try the WriteDecodedDoc
option of the command
> >>>> line
> >>>>>>>>>>> app
> >>>>>>>>>>>
> >>>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
> >>>>>>>>>>>
> >>>>>>>>>>>        Maybe you can have further ideas
by comparing the two
> >>>> files
> >>>>>>>>>>>        with NOTEPAD++.... however the two
files might have
> their
> >>>>>>>>>>>        objects in different order.
> >>>>>>>>>>>
> >>>>>>>>>>>        Tilman
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>>>>>        To unsubscribe, e-mail:
> >>>> users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>>>        <mailto:users-unsubscribe@pdfbox.apache.org>
> >>>>>>>>>>>        For additional commands, e-mail:
> >>>> users-help@pdfbox.apache.org
> >>>>>>>>>>>        <mailto:users-help@pdfbox.apache.org>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>>
> >>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <mailto:
> >>>> users-unsubscribe@pdfbox.apache.org>
> >>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> <mailto:
> >>>> users-help@pdfbox.apache.org>
> >>>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message