pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: PDFBox 1.8.4 and pdf's generated by MS Word
Date Mon, 31 Mar 2014 16:54:00 GMT
Hi Tim,

that’s a bug. 

Explanation: The original file uses what’s called a hybrid reference. That’s for compatibility
with readers which do not support compressed reference streams.  The file generated by PDFBox
doesn’t use hybrid references any more but still contains the XRefStm info in the trailer
dictionary.

Could you file an issue at https://issues.apache.org/jira/browse/PDFBOX

BR
Maruan

Am 31.03.2014 um 17:00 schrieb Tim Costermans <tim.costermans@unifiedpost.com>:

> Hi Muruan,
> 
> Thx for pointing out the attachments didn't get through.
> 2 pdf files and 1 patch file (containing test case to reproduce issue) are available
here: https://www.dropbox.com/sh/291b24dstixowgt/aQTZl5j_pP
> 
> Kind regards,
> Tim
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: maandag 31 maart 2014 16:47
> To: users@pdfbox.apache.org
> Subject: Re: PDFBox 1.8.4 and pdf's generated by MS Word
> 
> Hi Tim,
> 
> the attachment didn't make it through - could you upload it to a public location?
> 
> BR
> 
> Maruan
> 
> Am 31.03.2014 um 12:56 schrieb Tim Costermans <tim.costermans@unifiedpost.com>:
> 
>> Hello,
>> 
>> I've written a test case to reproduce the issue. (see patch)
>> 
>> Could someone have a look at it and give me some pointers on how to solve this issue?
I applied this patch on the 1.8.4 tag I checked out locally.
>> The issue is that I don't know the pdf spec, so I don't know how to fix this issue
in the PDFBOX source code.
>> 
>> Word2010.pdf is the input pdf, I open the document with PDFBOX add a string to the
pdf. In this case 'Hello world!'.
>> Afterwards I save the pdf.
>> 
>> If I look at the content of the pdf before and after I modified it (using Notepad++)
I see this:
>> 
>> Word2010.pdf:
>> Line 647: <</Size 18/Root 1 0 R/Info 7 0 
>> R/ID[<AE9AF29D5A22AE47B47C4DA29170BE64><AE9AF29D5A22AE47B47C4DA29170BE
>> 64>] /Prev 81972/XRefStm 81702>>
>> 
>> modified_Word2010.pdf:
>> Line 791: /XRefStm 81702
>> 
>> XRefStm is not updated although the original pdf had multiple revisions that were
merged into a new pdf document.
>> 
>> A third party library we use defends on this XRefStm value and cannot 
>> open the pdf after it was modified. (Stack trace see previous msg) Any help would
be much appreciated.
>> 
>> Kind regards,
>> 
>> Tim Costermans
>> 
>> From: Tim Costermans
>> Sent: woensdag 26 maart 2014 14:31
>> To: 'users@pdfbox.apache.org'
>> Subject: PDFBox 1.8.4 and pdf's generated by MS Word
>> 
>> Hello,
>> 
>> It' seems that pdf's generated by MS Word 2010 or 2013 are a recipe for trouble in
combination with PDFBOX version 1.8.0 or 1.8.4.
>> I upgrade to PDFBOX 1.8.4 and one issue remains:
>> 
>> Caused by: **thirdparty.pdf.exceptions.PDFParsingException: [offset=91308]Expected
numeric object for object number
>>                        at **thirdparty.pdf.exceptions.PDFParsingException.newInstance(PDFParsingException.java:58)
>>                        at **thirdparty.pdf.io.PDFParser.throwEx(PDFParser.java:1215)
>>                        at **thirdparty.pdf.io.PDFParser.readCompressedCrossRefTable(PDFParser.java:805)
>>                        at **thirdparty.pdf.io.PDFParser.readCrossRefTable(PDFParser.java:1175)
>>                        at **thirdparty.pdf.PDFDocument.open(PDFDocument.java:154)
>>                        at **thirdparty.PDFDocument.open(PDFDocument.java:124)
>>                        at com.*****.sign.pdf.PDFPresigner.presign(PDFPresigner.java:24)
>>                        ... 26 more
>> 
>> How to reproduce:
>> 1) Fire up MS Word v 2010 , type some text, save as PDF.
>> 2) Open this pdf file with Notepad++, you will notice the following at the bottom
of the file:
>> ...
>> trailer
>> <</Size 18/Root 1 0 R/Info 7 0 
>> R/ID[<7AE435CBC968B94F8B050F40F6D5CE5F><7AE435CBC968B94F8B050F40F6D5CE
>> 5F>] >> startxref
>> 82089
>> %%EOF
>> xref
>> 0 0
>> trailer
>> <</Size 18/Root 1 0 R/Info 7 0 
>> R/ID[<7AE435CBC968B94F8B050F40F6D5CE5F><7AE435CBC968B94F8B050F40F6D5CE
>> 5F>] /Prev 82089/XRefStm 81819>> startxref
>> 82605
>> %%EOF
>> 
>> Our application is trying to add an image to this pdf using PDFBox, when calling
PDFDocument.save() the "revisions" are merged an a new pdf is being created.
>> The newly created pdf is being passed to a third party that tries to open it, but
it fails because XRefStm is not correctly updated during save.
>> Probably related to https://issues.apache.org/jira/browse/PDFBOX-1822
>> 
>> I also tried PDFDocument.incrementalSave() but then I get into a nullpointer exception
cuased by  PDFXRefStream:  List<Integer> indexEntry = getIndexEntry(); containing two
null objects. (first and last still being null and being added to the list).
>> How do I solve this issue?
>> What's the real issue here?
>> I'm not in control of the pdf's that the application can receive.
>> 
>> Also ran into the following bug but worked around it https://issues.apache.org/jira/browse/PDFBOX-1838
.
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message