pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Malcolm Vincent <malcolmvinc...@gmail.com>
Subject Re: Adobe InDesign PDF
Date Fri, 10 Nov 2017 09:04:32 GMT
Hi Tilman,

Thanks for replying. I'll see if I can get permission from the client
to upload the file.

The same behaviour occurs in PDFBox 2.0.x - I am on 2.0.8 currently.

I'm pretty definite now about what's happening.

The "issue" (if it is an issue) is that I was treating the streams the
same way that PDFReader does, and loading them one at a time.

It appears that this is not a safe thing to do because the streams are
not fully complete parseable entities in their own right and some
higher level token constructs - like COSDictionary for example - can
be split across stream boundaries by the Adobe PDF generators. So
although atomic tokens like int may generally be ok, more complex
things are not.

This is why my code that uses PDFbox is throwing the warnings and also
why it happens with the PDFReader / debug function in the app. Every
time I click on a stream with a dictionary that is partly in one
stream and partly in another the parser throws a warning on the
console.

I am unclear exactly how this fits with the specification - a quick
"find" has not cleared it up - but I suppose in theory since PDF is a
binary format the stream could break at any byte and any token could
be split right in the middle.

Following on from that analysis this appears to be the way to get the
tokens on a page and process them ... at least it has resolved my
problem on the PDF files I am currently processing ...

    PDPage page = my_pdf.getPage(i);
    PDFStreamParser parser = new PDFStreamParser(page);
    parser.parse();
    page.setContents(processTokens(parser.getTokens()));

where processTokens() is my worker function.

Of course this assumes that the generator has not broken atomic tokens
in the middle of the content since the PDFBox doc says streams parsed
this way are concatenated with a whitespace character between them.

For completeness here is a fragment of one of my PDFs which shows the
dictionary split across the end of one stream and the start of the
next ...

    /Span <</Lang (en-GB)/MCID 8 >>BDC
    BT
    9 0 0 9 99.3376 555.6879 Tm
    (Some text)Tj
    ET
    EMC
    /Span <</Lang
    endstream
    endobj
    19 0 obj
    <<
    /Length 2852
    >>
    stream
    (en-GB)/MCID 9 >>BDC
    BT
    9 0 0 9 145.7323 555.6879 Tm
    (Some more text)Tj
    ET
    EMC


Best Wishes,
Malcolm

On 9 November 2017 at 17:58, Tilman Hausherr <THausherr@t-online.de> wrote:
> Hi,
>
> What PDFBox version are you using and can you upload the PDF to a
> sharehoster? Splits between tokens shouldn't be a problem.
>
> Tilman
>
> PS: please don't start a new thread like you did today, this is confusing.
> Answer to yourself on the list instead.
>
>
> Am 09.11.2017 um 09:53 schrieb Malcolm Vincent:
>>
>> Hi,
>>
>> I've been using PDFBox to read and write PDFs successfully for a while
>> and have started running into a few issues recently.
>>
>> I seem to be getting the following errors when loading PDFs generated
>> in Adobe InDesign / Acrobat Distiller (the PDFs render fine in Acrobat
>> Reader, pdf.js and chrome).
>>
>> The first one seems to be a UI thing for the PDFReader function so I'm
>> ignoring it.
>>
>> The second and third are the problem. They are both related. I get
>> them when I use PDFBox in my own code as well as in the app, but since
>> they are warnings they do not flag up as runtime errors I can catch.
>>
>> #1
>> Nov 09, 2017 8:31:45 AM java.util.prefs.WindowsPreferences <init>
>> WARNING: Could not open/create prefs root node Software\JavaSoft\Prefs
>> at root 0x80000002. Windows RegCreateKeyEx(...) returned error code 5.
>>
>> #2
>> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
>> parseCOSDictionaryNameValuePair
>> WARNING: Bad Dictionary Declaration
>> org.apache.pdfbox.pdfparser.InputStreamSource@498fa7e0
>>
>> #3
>> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
>> parseCOSDictionary
>> WARNING: Invalid dictionary, found: '?' but expected: '/' at offset 2861
>>
>> I have traced the problem to the following PDF content at the end of
>> Page 1 Stream 1.
>>
>> /Span <</Lang (en-GB)/MCID 8 >>BDC
>> BT
>> 9 0 0 9 99.3376 555.6879 Tm
>> (text string here)Tj
>> ET
>> EMC
>> /Span <</Lang
>> endstream
>> endobj
>>
>> The last dictionary entry seems to be incomplete.
>>
>> When I go on to process the files in my own code, I iterate over the
>> content stream, perform my function and replace the stream content,
>> the stream ends up incorrect and the resulting PDFs will not load in
>> Acrobat Reader (although they do in chrome).
>>
>> My options appear to be
>>
>> (a) grep the file for this and remove or overwrite it with a string
>> operation before using PDFBox
>>
>> (b) update the source to cope with this condition
>>
>> (c) kick the PDF back as invalid - difficult since the file is a
>> "valid" PDF that is generated in Adobe and reads ok in Adobe
>>
>> I have verified this by manually overtyping <</Lang with spaces and
>> then everything works perfectly in my own code and in PDFReader.
>>
>> Any thoughts?
>>
>> Best wishes,
>> Malcolm.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message