Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3E1CC17F5B for ; Mon, 23 Feb 2015 11:45:25 +0000 (UTC) Received: (qmail 18773 invoked by uid 500); 23 Feb 2015 11:45:10 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 18749 invoked by uid 500); 23 Feb 2015 11:45:10 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 18737 invoked by uid 99); 23 Feb 2015 11:45:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Feb 2015 11:45:10 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE X-Spam-Check-By: apache.org Received-SPF: error (athena.apache.org: local policy) Received: from [81.169.146.217] (HELO mo4-p00-ob.smtp.rzone.de) (81.169.146.217) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Feb 2015 11:45:05 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; t=1424691800; l=11824; s=domk; d=lehmi.de; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Subject: References:In-Reply-To:To:Reply-To:From:Date; bh=+gPqgU1SgNKDMu5+haGyheGixNRGGcBs2jUh6PVsBJM=; b=s5JPex5ozuAaGjInvYUslyCL7SckiSUpb43M3RdI4kwCWyW0MTQBgJyjoDoJRvWugKw PypWf721PP+4VKzb11C8N7RTqdhm8blZ95ubPSdyMBHYJv/MthWgQ2P42JA+mjHObCuca /MtS781MRVSMXlAuuv5Lw1Zcm48lA9xJhF4= X-RZG-AUTH: :LWIAZ0WpaN8UY5o8XRz0jOyrHsdLFu/Eofc5177QYpz2qXXhjsXpYVO4Ug== X-RZG-CLASS-ID: mo00 Received: from ptangptang.store (com4.strato.de [81.169.145.237]) by smtp-ox.front (RZmta 37.3 AUTH) with ESMTPSA id j001a5r1NBhKCWG (using TLSv1.2 with cipher ECDHE-RSA-AES128-SHA (128/128 bits)) (Client did not present a certificate) for ; Mon, 23 Feb 2015 12:43:20 +0100 (CET) Date: Mon, 23 Feb 2015 12:43:19 +0100 (CET) From: =?UTF-8?Q?Andreas_Lehmk=C3=BChler?= Reply-To: =?UTF-8?Q?Andreas_Lehmk=C3=BChler?= To: users@pdfbox.apache.org Message-ID: <1814202922.432605.1424691800067.JavaMail.open-xchange@ptangptang.store> In-Reply-To: <1424127935975.63733@Yuzu.com> References: <1423866885478.74448@Yuzu.com>,<421622107.2616801.1424086476194.JavaMail.open-xchange@omgreatgod.store> <1424127935975.63733@Yuzu.com> Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Priority: 3 Importance: Medium X-Mailer: Open-Xchange Mailer v7.6.0-Rev36 X-Originating-Client: com.openexchange.ox.gui.dhtml X-Virus-Checked: Checked by ClamAV on apache.org Hi, I've improved the self repair mechnism of the trunk based on Steves report. @Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue = still persist? BR Andreas Lehmk=C3=BChler > Steve Antoch hat am 17. Februar 2015 um 00:05 geschrie= ben: >=20 >=20 >=20 > Andreas- > Thanks for the response. > Sorry for sending directly. >=20 > Yes, it tries to read from offset 112085940, but does not find the xrefst= m > there, so=20 > that's when it goes searching. It seems to be landing in the middle of > something else (perhaps an image?) >=20 > I tried running the preflight command on the file, and this is what it fo= und > there. > This is in the middle of a whole series of repetitive byte patterns like > these, which is interspersed with other sections of content that is also > binary only. >=20 > > > 2646 > false > > > 1.0 >
Syntax error, Error: Expected a long type at offset 112085= 940, > instead got > '6l=C3=99=C2=B3f=C3=8D›6l=C3=99=C2=B3f=C3=8D›6l=C3=99=C2=B3f=C3= =8D›6l=C3=99=C2=B3f=C3=8D›6l=C3=99=C2=B3f=C3=8D›6l=C3=99=C2= =B3f=C3=8D›6l=C3=99=C2=B3f=C3=8D›6l=C3=99=C2=B3f=C3=8D›6l=C3= =99=C2=B3f=C3=8D›6l=C3=99=C2=B3f=C3=8D›6l=C3=99=C2=B3f=C3=8D›= ;6l=C3=99=C2=B3f=C3=8D›6l=C3=99=C2=B3f=C3=8D›6l=C3=99=C2=B3f=C3= =8D›6l=C3=99=C2=B3f=C3=8D›6l=C3=99=C2=B3f=C3=8D›6l=C3=99=C2= =B3f=C3=8D›6l=C3=99=C2=B3f=C3=8D›6l=C3=99=C2=B3f=C3=8D›6l=C3= =99=C2=B3f=C3=8D›6l=C3=99=C2=B3f=C3=8D›6l=C3=99=C2=B3f=C3=8D›= ;6l=C3=99=C2=B3f=C3=8D›6l=C3=99=C2=B3f=C3=8D›6l=C3=99=C2=B3f=C3= =8D›6l=C3=99=C2=B1=C2=AF=C3=93"z=C2=B7Cœ3=C3=8D}y =C3=B3= =C2=A3g‚?1=C2=BA=C2=B7=C3=93ž-=C3=B3V=C3=8F:=C3=AB=C2= =BDNs=C3=8BŽ=C2=B86l=C3=99=C2=B3f=C3=85#=C3=AB“=C2= =A8=C3=8E=C3=B7=C3=A5.=C2=A3=3D‰=C3=B9}=C3=95s=C3=9E=C3=BF'
>
>
>
>=20 > The patterns seem to be: >=20 > lots of these: 6l=C3=99=C2=B3f=C3=8D› > interspersed between blocks that are similar to this: > =C2=B1=C2=AF=C3=93"z=C2=B7Cœ3=C3=8D}y =C3=B3=C2=A3g = 0;?1=C2=BA=C2=B7=C3=93ž-=C3=B3V=C3=8F:=C3=AB=C2=BDNs=C3=8BŽ= =C2=B86l=C3=99=C2=B3f=C3=85#=C3=AB“=C2=A8=C3=8E=C3=B7= =C3=A5.=C2=A3=3D‰=C3=B9}=C3=95s=C3=9E=C3=BF' >=20 > It just so happens that the offset 112085940 falls right in the middle of= a > big block of those 6l=C3=99=C2=B3f=C3=8D› repetitive blocks. >=20 > Not sure if that's any help.=20 >=20 > Steve >=20 > ________________________________________ > From: Andreas Lehmk=C3=BChler > Sent: Monday, February 16, 2015 3:34 AM > To: users@pdfbox.apache.org > Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still pres= ent > (or variation of it still present) >=20 > Hi, >=20 > > Steve Antoch hat am 13. Februar 2015 um 23:34 > > geschrieben: > > > > > > > > =E2=80=8B=E2=80=8B=E2=80=8B=E2=80=8B=E2=80=8BHi Tilman and Andreas-- > Please don't contact developers directly, use our mailing lists instead. = I've > put the users list back into the boat... >=20 > > I am working with Krasimir on this issue. > > > > Although we asked, we were denied permission to send the document out. > :-( >=20 > > The failure is being triggered when we attempt to use the Encrypt() cla= ss to > > password protect the pdf. > > We end up with the "Expected a long type at offset 113884174, instead g= ot > > 'xref'" failure. > > > > I have debugged into the PDFBox code and found the offending parts. > > > > PdfBox is trying to parse an xref table located at 113884174. > > > > The problem we are seeing is that the inside the trailer it finds the > > /XRefStm > > label, and its offset value is returned as 112085940 (which is what is = given > > in the file), > > However, the checkXRefOffset() call made to verify it doesn't find the = xref > > stream there, so it goes searching and ends up returning the closest xr= ef > > offset it can find, which happens to be that it returns its own offset = at > > 113884174. > > > > > > I believe that there is an error in PdfBox with respect to this fixup l= ogic, > > even if it had found the 'correct' xref stream. > > That is because the fixup offset can NEVER work. Every time it fixes u= p the > > location, it lands on a section which begins with "xref". > > The next call is to skip the whitespace, but since there is never any t= here > > (it's already proven to be 'xref'), it does not advance the input stre= am. > > Then, the first call to parse that xrefstm always calls readObjectID(), > > which > > always will throw the exception because the bytes are always 'xref'. > > > > So, my questions are: > > > > 1) Are these docs fixable or are they truly corrupt? > Without having a hand on the pdf itself it's hard to give a 100% answer. = But I > guess there has to be fix, as adobe is able to open that pdf. I'll try to= find > one, following your description of the pdf >=20 > > 2) Is this xref issue a known issue with PdfBox? I would try to create= a > > document that displays the error but I honesty don't know how to do so > > (beyond > > sending the ones that we have that DO display it). > Not until now >=20 > > 3) Do you have any idea how these documents end up in this state if the= y are > > being edited by tools such as InDesign, Acrobat, etc? Is there somethin= g I > > can > > do to identify them? > There are a lot of more or less corrupt files in the wild. Those are crea= ted > using different tools. >=20 > > 4) If this is a truly corrupted document, why would Acrobat be able to = open > > these files but pdfBox cannot? Are these streams somehow ignorable? I= ask > > this because I saw this statement on a web page > > (http://resources.infosecinstitute.com/pdf-file-format-basic-structure= /) > > when > > I was searching for answers on this: > Adobe implements a lot of self healing mechanisms to repair broken files = and > we > try to do so too, but it's complicated. >=20 > > =E2=80=93 /XrefStm [integer]: specifies the offset from the beginning o= f the file to > > the cross-reference stream in the decoded stream. This is only present = in > > hybrid-reference files, which is specified if we would also like to ope= n > > documents even if the applications don=E2=80=99t support compressed re= ference > > streams. > > > > Any light you can shed on this is appreciated. > > > > Thanks- > > Steve > > > > > > See below for the pertinent data and the code which is marked with the > > values > > as I traced through. > > > > I have confirmed that the byte offset of the word xref below is exactly= at > > 113884174. >=20 > Does the xref stream start at 112085940 (stream offset from the trailer > dictionary) or what did you find at that offset? >=20 >=20 > > xref > > 0 53641 > > 0000000000 65535 f > > 0000000017 00000 n > > > > > > > > > > trailer > > \<\< > > /Size 53641 > > /Root 1 0 R > > /XRefStm 112085940 > > /Info 8 0 R > > /ID [\<19790A83488211E283B50017F203355C> > > \] > > >> > > startxref > > 113884174 > > %%EOF1 0 obj\<\ > R/StructTreeRoot 6 0 R/Type/Catalog/PageLabels 7 0 R>> > > endobj > > > > > > protected COSDictionary parseXref(long startXRefOffset) throws > > IOException > > { > > pdfSource.seek(startXRefOffset); > > long startXrefOffset =3D parseStartXref(); > > // check the startxref offset > > long fixedOffset =3D checkXRefOffset(startXrefOffset); > > if (fixedOffset > -1) > > { > > startXrefOffset =3D fixedOffset; > > } > > document.setStartXref(startXrefOffset); > > long prev =3D startXrefOffset; > > // ---- parse whole chain of xref tables/object streams using P= REV > > reference > > while (prev > -1) <=3D=3D prev here is 113884174. > > { > > // seek to xref table > > pdfSource.seek(prev); > > > > // skip white spaces > > skipSpaces(); > > // -- parse xref > > if (pdfSource.peek() =3D=3D X) > > { > > // xref table and trailer > > // use existing parser to parse xref table > > parseXrefTable(prev); > > // parse the last trailer. > > trailerOffset =3D pdfSource.getOffset(); > > // PDFBOX-1739 skip extra xref entries in RegisSTAR > > documents > > while (isLenient && pdfSource.peek() !=3D 't') > > { > > if (pdfSource.getOffset() =3D=3D trailerOffset) > > { > > // warn only the first time > > LOG.warn("Expected trailer object at position "= + > > trailerOffset > > + ", keep trying"); > > } > > readLine(); > > } > > if (!parseTrailer()) > > { > > throw new IOException("Expected trailer object at > > position: " > > + pdfSource.getOffset()); > > } > > COSDictionary trailer =3D > > xrefTrailerResolver.getCurrentTrailer(); > > // check for a XRef stream, it may contain some object = ids > > of > > compressed objects > > if(trailer.containsKey(COSName.XREF_STM)) <=3D=3D YES = - but > > falue > > { > > int streamOffset =3D trailer.getInt(COSName.XREF_ST= M); > > <=3D=3D > > This returns 112085940, which is the value from the trailer > > // check the xref stream reference > > fixedOffset =3D checkXRefOffset(streamOffset); > > <=3D=3D > > checks it and returns 113884174 instead > > if (fixedOffset > -1 && fixedOffset !=3D streamOffs= et) > > { > > streamOffset =3D (int)fixedOffset; > > trailer.setInt(COSName.XREF_STM, streamOffset); > > } > > pdfSource.seek(streamOffset); <=3D=3D Seeks to 113= 884174 > > //readExpectedString(XREF_TABLE, false); > > skipSpaces(); <=3D=3D=3D It's ON "xref", so= it doesn't > > skip anything > > parseXrefObjStream(prev, false); <=3D=3D goes in he= re, first > > thing it tries to do is readObjectNumber(), which can't work because it= 's > > 'xref' -- BOOM > > } > > prev =3D trailer.getInt(COSName.PREV); > > if (prev > -1) > > { > > // check the xref table reference > > fixedOffset =3D checkXRefOffset(prev); > > if (fixedOffset > -1 && fixedOffset !=3D prev) > > { > > prev =3D fixedOffset; > > trailer.setLong(COSName.PREV, prev); > > } > > } > > } > > else > > { > > // parse xref stream > > prev =3D parseXrefObjStream(prev, true); > > if (prev > -1) > > { > > // check the xref table reference > > fixedOffset =3D checkXRefOffset(prev); > > if (fixedOffset > -1 && fixedOffset !=3D prev) > > { > > prev =3D fixedOffset; > > COSDictionary trailer =3D > > xrefTrailerResolver.getCurrentTrailer(); > > trailer.setLong(COSName.PREV, prev); > > } > > } > > } > > } > > // ---- build valid xrefs out of the xref chain > > xrefTrailerResolver.setStartxref(startXrefOffset); > > COSDictionary trailer =3D xrefTrailerResolver.getTrailer(); > > document.setTrailer(trailer); > > document.setIsXRefStream(XRefType.STREAM =3D=3D > > xrefTrailerResolver.getXrefType()); > > // check the offsets of all referenced objects > > checkXrefOffsets(); > > // copy xref table > > document.addXRefTable(xrefTrailerResolver.getXrefTable()); > > return trailer; > > } >=20 >=20 > BR > Andreas Lehmk=C3=BChler >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org > For additional commands, e-mail: users-help@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org