Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E956B178F9 for ; Wed, 27 May 2015 05:46:11 +0000 (UTC) Received: (qmail 34248 invoked by uid 500); 27 May 2015 05:45:57 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 34217 invoked by uid 500); 27 May 2015 05:45:57 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 34187 invoked by uid 99); 27 May 2015 05:45:57 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 May 2015 05:45:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 9F19AC8FF9 for ; Wed, 27 May 2015 05:45:56 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.791 X-Spam-Level: * X-Spam-Status: No, score=1.791 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id kYq06Mnm6Rjg for ; Wed, 27 May 2015 05:45:41 +0000 (UTC) Received: from mailout04.t-online.de (mailout04.t-online.de [194.25.134.18]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 4BFAB20944 for ; Wed, 27 May 2015 05:45:41 +0000 (UTC) Received: from fwd30.aul.t-online.de (fwd30.aul.t-online.de [172.20.26.135]) by mailout04.t-online.de (Postfix) with SMTP id 5953D4505FA for ; Wed, 27 May 2015 07:45:03 +0200 (CEST) Received: from [192.168.2.102] (XV2pJBZpwhs9sAnu91f0gV9UgP0vqKutgRwnRyEGQr-HeY-uorrmmDHL9wyyuMkQS0@[217.231.141.210]) by fwd30.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-SHA encrypted) esmtp id 1YxU9E-3M6ipc0; Wed, 27 May 2015 07:45:00 +0200 Message-ID: <556559F1.9060609@t-online.de> Date: Wed, 27 May 2015 07:45:21 +0200 From: Tilman Hausherr User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: users@pdfbox.apache.org Subject: Re: How to flatedecode and find all acroform fields in a compressed PDF References: <555B9A9B.1060509@t-online.de> <555BB1C6.7060501@t-online.de> <364896445.259457.1432276129435.JavaMail.open-xchange@patina.store> <555F32F5.8090004@t-online.de> <555F8975.4010309@t-online.de> <2EE7B752-8781-4BCE-877A-515B3E982113@fileaffairs.de> <60CA9741-1B8A-4FBE-A67B-821BFE38A235@fileaffairs.de> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-ID: XV2pJBZpwhs9sAnu91f0gV9UgP0vqKutgRwnRyEGQr-HeY-uorrmmDHL9wyyuMkQS0 X-TOI-MSGID: 0785f268-b34a-4883-a50f-dbc669d27451 I just tested it. It also removes /Outlines and /Metadata and more important data from PDF files. So your client can't share the PDF with us, but he shared it some website. A little research shows that this website is owned by Lauri Lehtinen from Talinn, Estonia. http://www.checkdomain.com/cgi-bin/checkdomain.pl?domain=pdfcompress.com https://www.linkedin.com/in/laurilehtinen https://twitter.com/laurii I also tweeted him. Tilman Am 27.05.2015 um 03:06 schrieb Balaji Venkatamohan: > Okay, I found out the online tool used by the customer to compress their > PDF. > > It is : https://www.pdfcompress.com/ > > I don't need to rely on the PDF sent by the customer because all PDFs that > are available on the web, are compressed in the same manner by this tool, > that is, it gets rid of all acro form fields during compression. > > For example, the f941 govt form available at this site: > http://www.irs.gov/pub/irs-pdf/f941.pdf > If we compress this using the online tool, the resultant file size is very > low, which is good. However, there are no acro form fields in the > compressed PDF. > > Thanks, > Balaji > > > > On Sun, May 24, 2015 at 2:38 AM, Maruan Sahyoun > wrote: > >> Hi, >> >>> Am 23.05.2015 um 16:37 schrieb Balaji Venkatamohan : >>> >>> Hi, >>> >>> So AcroForms/Fields is an empty Array? >>> >>> Yes, in the filled interview_compressed.pdf, the acroforms are not null >> but >>> empty. Size of array is zero. >>> >>> Also, I tried qpdf command line tool to compress the file interview.pdf >> and >>> the resultant compressed file size of 1.6MB was no way near the file size >>> of interview_compressed.pdf (21 KB). >> would you think it's possible to get a similar PDF file or permission to >> use it internally so we have a sample to look at a potential fix. >> >> Although the PDF is not inline with the spec as Acrobat is able to handle >> it we could look into getting a similar result. >> >> BR >> Maruan >> >> >>> Thanks, >>> Balaji >>> >>> On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun >> >>> wrote: >>> >>>> Hi, >>>> >>>>> Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan >> : >>>>> I opened the interview_compressed in notepad++ and did not see any >>>>> 'Acroform' text anywhere. >>>>> However, as Maruan suggested, I entered some data into what looks like >>>> form >>>>> fields of interview_compressed.pdf and saved it. When I opened this >> file >>>> in >>>>> notepad++, I did see 'Acroform' text in it. I also noticed an increase >> in >>>>> file size from 21 KB to ~530 KB. >>>>> >>>>> I then ran this filled saved compressed PDF in pdfdebugger.java and saw >>>>> that the field values were getting stored but not under Acroform fields >>>> but >>>>> under Annotations. >>>> >>>> >>>> So AcroForms/Fields is an empty Array? >>>> >>>>> Please refer to this image: >>>>> >>>>> http://imageshack.com/a/img540/9951/QGLDtS.jpg >>>>> >>>>> So, whatever the compression technique was, it simply made all the >>>> Acroform >>>>> fields disappear from the original PDF but retained all annotations >> which >>>>> also contain the interactive forms and this helped reduce the file size >>>> so >>>>> much? If this is the case, can pdfbox API also use similar compression >>>>> technique to compress such a a huge file into a smaller one? >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun < >> sahyoun@fileaffairs.de> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr < >> THausherr@t-online.de >>>>> : >>>>>>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan: >>>>>>>> Hello, >>>>>>>> >>>>>>>> I used PdfDebugger to make the internal PDF structure of the two >> files >>>>>> (1) >>>>>>>> interview.pdf and (2) interview_compressed.pdf visually available >>>> and I >>>>>>>> have uploaded my images to imageshack. Here are the four links: >>>>>>>> >>>>>>>> http://imageshack.com/a/img538/8277/JghCpG.jpg >>>>>>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg >>>>>>>> http://imageshack.com/a/img903/8644/mk15As.jpg >>>>>>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg >>>>>>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg >>>>>>>> >>>>>>>> The first two links are from the internal structure of interview.pdf >>>>>>>> (original uncompressed file) >>>>>>>> The third and fourth links are from the internal structure of >>>>>>>> interview_compressed.pdf (compressed file) >>>>>>>> The fifth link compares the file sizes of the two files and as you >> can >>>>>> also >>>>>>>> see, the difference is huge. >>>>>>>> >>>>>>>> As you might notice, the file interview_compressed.pdf has no >> acroform >>>>>>> Indeed... but this is needed - from the spec: >>>>>>> >>>>>>> "The contents and properties of a document’s interactive form shall >> be >>>>>> defined by an interactive form dictionary that shall be referenced >> from >>>> the >>>>>> AcroForm entry in the document catalogue (see 7.7.2, “Document >>>> Catalog”). >>>>>> Table 218 shows the contents of this dictionary." >>>>>> correct >>>>>> >>>>>>>> fields listed even though opening the PDF in pdf reader allows me to >>>>>> enter >>>>>>>> values in places which look like AcroForm fields and also save them. >>>> Are >>>>>>>> there any other PDF 'types' similar to Acroform fields which would >>>>>> enable >>>>>>>> users to fill data and which can be accessed in PdfBox APIs without >>>>>> having >>>>>>>> to go through PDAcrofield? >>>>>>> Yes, annotations... there are some common parts, but this is just a >>>>>> vague observation from me, I'm not the acroform specialist. >>>>>> >>>>>> from a first glance it looks like there are all entries necessary to >>>> (re-) >>>>>> generate the form fields. That's what's likely happening for this >>>> document >>>>>> in Adobe Reader. Would be interesting to see what's being save after >> the >>>>>> forms has been filled out and saved using Acrobat. We'd need a test >>>> form to >>>>>> come up with an enhancement like this. >>>>>> >>>>>> BR >>>>>> Maruan >>>>>> >>>>>> >>>>>>> What you should do: use NOTEPAD++ to look whether there's "/AcroForm" >>>> in >>>>>> the "compressed" file. >>>>>>> - if it is missing, tell the client (or your boss) just that >>>>>>> - if it isn't missing, then there's some problem in PDFBox (try also >>>> the >>>>>> loadNonSeq I mentioned earlier) >>>>>>> Tilman >>>>>>> >>>>>>>> You can use qpdf , then use these options: >>>>>>>> >>>>>>>> I will now try using this link to compress the original file. >>>>>>>> >>>>>>>> Another strategy to think about - can your client generate a >>>>>>>> non-confidential file, so that you can share it, and the >> "compressed" >>>>>> file? >>>>>>>> I wish I had direct communication with the clients but due to >>>>>> bureaucracy, >>>>>>>> I am having to go through multiple layers to get my message across >> to >>>>>> them. >>>>>>>> I will share more information as soon as I have them. >>>>>>>> >>>>>>>> PS: i sent these image links to my personal email first to make sure >>>>>> that I >>>>>>>> can open them. I could and so I am hoping you all could too. If you >>>> are >>>>>>>> unable to open them, please let me know. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Balaji >>>>>>>> >>>>>>>> >>>>>>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr < >>>> THausherr@t-online.de >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Balaji Venkatamohan hat am 20. Mai 2015 um >>>>>> 03:24 >>>>>>>>>>> geschrieben: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thank you for your pointers and sorry about the image. I am >>>>>> attaching it >>>>>>>>>>> with this email. >>>>>>>>>>> >>>>>>>>>>> The point I am trying to make is that the PDF, which was >>>> decompressed >>>>>>>>>>> using >>>>>>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given >> to >>>>>> us by >>>>>>>>>>> our customers. >>>>>>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of >> PDFBox >>>>>> did >>>>>>>>>>> not >>>>>>>>>>> have any PDAcroform fields whereas the decompressed PDF given to >> us >>>>>> by >>>>>>>>>>> the >>>>>>>>>>> customers does contain Acroform fields. Hence I wanted to know >> how >>>> to >>>>>>>>>>> properly decompress the PDF using pdfbox APIs. The reason why I >> was >>>>>>>>>>> analyzing COSStream was to check if the decompression of the >>>>>> compressed >>>>>>>>>>> PDF >>>>>>>>>>> was happening correctly while using PDFBox APIs. >>>>>>>>>>> I know it would have been difficult for you to help me without >> the >>>>>> actual >>>>>>>>>>> PDFs. For that, I would like to thank you for your time and >>>> pointers. >>>>>>>>>> Maybe it's worth to try to share the file "visually" with us. Open >>>>>> both >>>>>>>>>> files >>>>>>>>>> (compressed and decompressed) with PDFDebugger [1] and post a >>>>>> screenshot >>>>>>>>>> of both >>>>>>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that >>>> could >>>>>>>>>> shed some >>>>>>>>>> light on your issue. >>>>>>>>>> >>>>>>>>> @Balaji: here's an example on how such a screenshot would look >> like: >>>>>>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png >>>>>>>>> >>>>>>>>> Tilman >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> BR >>>>>>>>>> Andreas Lehmkühler >>>>>>>>>> >>>>>>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger >>>>>>>>>> >>>>>>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr < >>>>>> THausherr@t-online.de> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>>> The image doesn't appear in the mailing list. >>>>>>>>>>>> >>>>>>>>>>>> This is all very confusing... /acroform is in the document >>>> catalog. >>>>>> I >>>>>>>>>>>> don't see how the page content stream is related to it. The best >>>> is >>>>>> that >>>>>>>>>>>> you either go through the source code, or read the spec and then >>>>>> look at >>>>>>>>>>>> the pdf. >>>>>>>>>>>> >>>>>>>>>>>> To find out what's going on, you'd have to start from that >>>> /acroform >>>>>>>>>>>> entry >>>>>>>>>>>> and then compare the two files. >>>>>>>>>>>> >>>>>>>>>>>> It is really difficult to help you without the files. The cause >>>>>> could >>>>>>>>>>>> be a >>>>>>>>>>>> bug in pdfbox, or a malformed pdf... >>>>>>>>>>>> >>>>>>>>>>>> Some more ideas: >>>>>>>>>>>> - use loadNonSeq(file, null) instead of load(file) >>>>>>>>>>>> - try the unreleased 2.0 version, that one has some improvements >>>> in >>>>>> the >>>>>>>>>>>> acroform stuff. Note that the API is different. >>>>>>>>>>>> https://pdfbox.apache.org/download.cgi#scm >>>>>>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html >>>>>>>>>>>> >>>>>>>>>>>> If you still need help, one possibility would be 1) post the >>>>>> smallest >>>>>>>>>>>> possible code that fails, and 2) post a small part of the raw >> PDF, >>>>>> i.e. >>>>>>>>>>>> the >>>>>>>>>>>> objects relevant to the field in your code. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Tilman >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan: >>>>>>>>>>>> >>>>>>>>>>>> Moreover, for every page of the compressed PDF (there are 3 >>>>>> pages), I >>>>>>>>>>>>> tried getting the COSStream for each of the page : >>>>>>>>>>>>> >>>>>>>>>>>>> PDPage firstPage=(PDPage) >>>>>>>>>>>>> document.getDocumentCatalog().getAllPages().get(0); >>>>>>>>>>>>> pdStream=firstPage.getContents(); >>>>>>>>>>>>> COSStream stream=pdStream.getStream(); >>>>>>>>>>>>> >>>>>>>>>>>>> In the above code snippet, the object stream, when analyzed in >>>>>> debug >>>>>>>>>>>>> mode, has the following: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> The line from the compressed PDF as opened with Notepad++ is : >>>>>>>>>>>>> >>>>>>>>>>>>> <>stream >>>>>>>>>>>>> >>>>>>>>>>>>> From this point on, using the COSStream object for every page, >>>> how >>>>>>>>>>>>> can I >>>>>>>>>>>>> decompress and find out the acroform fields given that the >>>>>>>>>>>>> unFilteredStream >>>>>>>>>>>>> object is null for COSStream? >>>>>>>>>>>>> ​ >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan < >>>>>>>>>>>>> bvenkata@tibco.com >>>>>>>>>>>>> > wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you for your response Tilman. >>>>>>>>>>>>> >>>>>>>>>>>>> I had previously tried using the WriteDecodedDoc for my >>>>>> compressed >>>>>>>>>>>>> PDF and I tried to get the number of acro form fields >> present >>>>>> in >>>>>>>>>>>>> the output file generated by WriteDecodedDoc. The API still >>>> could >>>>>>>>>>>>> not find the acro form fields in the generated decompressed >>>>>> file. >>>>>>>>>>>>> Also the decompressed file generated is 75 KB which is far >>>>>> less >>>>>>>>>>>>> than the original decompressed file which I have (1.6 MB) >>>>>> though I >>>>>>>>>>>>> could edit the acro form fields using acrobat reader. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Balaji >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr >>>>>>>>>>>>> > >>>> wrote: >>>>>>>>>>>>> Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan: >>>>>>>>>>>>> >>>>>>>>>>>>> My question is: how do I flatedecode a PDF so that I >>>>>> can >>>>>>>>>>>>> find all the >>>>>>>>>>>>> acroform fields within it. ANy help or pointers >> would >>>>>> be >>>>>>>>>>>>> highly appreciated. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> You could try the WriteDecodedDoc option of the command >>>>>> line >>>>>>>>>>>>> app >>>>>>>>>>>>> >>>>>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc >>>>>>>>>>>>> Maybe you can have further ideas by comparing the two >>>>>> files >>>>>>>>>>>>> with NOTEPAD++.... however the two files might have >> their >>>>>>>>>>>>> objects in different order. >>>>>>>>>>>>> >>>>>>>>>>>>> Tilman >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>> To unsubscribe, e-mail: >>>>>> users-unsubscribe@pdfbox.apache.org >>>>>>>>>>>>> >>>>>>>>>>>>> For additional commands, e-mail: >>>>>> users-help@pdfbox.apache.org >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org >>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org >>>>>>>>>>> >>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org >>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org >>>>>>>>>> >>>>>>>>>> >> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org >>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org >>>>> users-unsubscribe@pdfbox.apache.org> >>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org >> >>>>> users-help@pdfbox.apache.org> >>>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org >>>> For additional commands, e-mail: users-help@pdfbox.apache.org >>>> >>>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org >> For additional commands, e-mail: users-help@pdfbox.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org