pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Adobe InDesign PDF
Date Thu, 16 Nov 2017 17:10:35 GMT
Now I got it. Yes, I forgot that PDFDebugger needs to parse individual 
streams when showing streams in the stream window, and then errors can 
happen because streams are incomplete.

How to reproduce the problem:

         PDDocument doc = new PDDocument();
         PDPage page = new PDPage();
         PDPageContentStream cs = new PDPageContentStream(doc, page, 
AppendMode.APPEND, false);
         OutputStream os = cs.getOutput();
         os.write(("0 g\n"
                 + "/Span <</Lang (en-GB)/MCID 8 >>BDC\n"
                 + "    BT\n"
                 + "    9 0 0 9 99.3376 555.6879 Tm\n"
                 + "    (Some text)Tj\n"
                 + "    ET\n"
                 + "    EMC\n"
                 + "    /Span <</Lang").getBytes());
         os.flush();
         cs.close();
         cs = new PDPageContentStream(doc, page, AppendMode.APPEND, false);
         os = cs.getOutput();
         os.write(("(en-GB)/MCID 9 >>BDC\n"
                 + "    BT\n"
                 + "    9 0 0 9 145.7323 555.6879 Tm\n"
                 + "    (Some more text)Tj\n"
                 + "    ET\n"
                 + "    EMC").getBytes());
         os.flush();
         cs.close();
         doc.addPage(page);
         doc.save(new File("2streams.pdf"));

However there is no solution, this is by design...to do syntax 
highlighting one needs to parse. It happens in StreamPane.java in the 
debugger subproject.

Tilman



Am 16.11.2017 um 11:25 schrieb Malcolm Vincent:
> Hi Tilman,
>
> They fail because they parse the stream every time you click on it -
> rather than the whole page.
>
> I haven't checked the code yet to find it, but you can tell because
> every time you click on (an incomplete) stream in PDFReader it throws
> the same exceptions again on the console. From a debugging perspective
> this is a great feature to have and most of the time streams seem to
> be generated as complete entities.
>
> Like I say - now I know you can't parse streams individually in normal
> use I don't have a problem. It was my logic that was at fault.
>
> Best Wishes
> Malcolm.
>
>
>
>
>
>
> On 10 November 2017 at 16:46, Tilman Hausherr <THausherr@t-online.de> wrote:
>> Hi,
>>
>> Yes it is true that page content streams can be split. But
>> PDFReader/PDFDebugger should be able to handle that because deep inside,
>> PDFStreamParser() is called when a page is rendered. PDFReader/PDFDebugger
>> do show the individual streams for debugging purpose but they're not
>> rendering them individually. So I'm wondering how it is possible that they
>> fail but you succeed.
>>
>> Re the java warning, I get it too and didn't even bother to fix it. See
>> https://stackoverflow.com/questions/5354838/java-java-util-preferences-failing
>> https://stackoverflow.com/questions/16428098/groovy-shell-warning-could-not-open-create-prefs-root-node
>>
>> Tilman
>>
>>
>>
>> Am 10.11.2017 um 10:04 schrieb Malcolm Vincent:
>>> Hi Tilman,
>>>
>>> Thanks for replying. I'll see if I can get permission from the client
>>> to upload the file.
>>>
>>> The same behaviour occurs in PDFBox 2.0.x - I am on 2.0.8 currently.
>>>
>>> I'm pretty definite now about what's happening.
>>>
>>> The "issue" (if it is an issue) is that I was treating the streams the
>>> same way that PDFReader does, and loading them one at a time.
>>>
>>> It appears that this is not a safe thing to do because the streams are
>>> not fully complete parseable entities in their own right and some
>>> higher level token constructs - like COSDictionary for example - can
>>> be split across stream boundaries by the Adobe PDF generators. So
>>> although atomic tokens like int may generally be ok, more complex
>>> things are not.
>>>
>>> This is why my code that uses PDFbox is throwing the warnings and also
>>> why it happens with the PDFReader / debug function in the app. Every
>>> time I click on a stream with a dictionary that is partly in one
>>> stream and partly in another the parser throws a warning on the
>>> console.
>>>
>>> I am unclear exactly how this fits with the specification - a quick
>>> "find" has not cleared it up - but I suppose in theory since PDF is a
>>> binary format the stream could break at any byte and any token could
>>> be split right in the middle.
>>>
>>> Following on from that analysis this appears to be the way to get the
>>> tokens on a page and process them ... at least it has resolved my
>>> problem on the PDF files I am currently processing ...
>>>
>>>       PDPage page = my_pdf.getPage(i);
>>>       PDFStreamParser parser = new PDFStreamParser(page);
>>>       parser.parse();
>>>       page.setContents(processTokens(parser.getTokens()));
>>>
>>> where processTokens() is my worker function.
>>>
>>> Of course this assumes that the generator has not broken atomic tokens
>>> in the middle of the content since the PDFBox doc says streams parsed
>>> this way are concatenated with a whitespace character between them.
>>>
>>> For completeness here is a fragment of one of my PDFs which shows the
>>> dictionary split across the end of one stream and the start of the
>>> next ...
>>>
>>>       /Span <</Lang (en-GB)/MCID 8 >>BDC
>>>       BT
>>>       9 0 0 9 99.3376 555.6879 Tm
>>>       (Some text)Tj
>>>       ET
>>>       EMC
>>>       /Span <</Lang
>>>       endstream
>>>       endobj
>>>       19 0 obj
>>>       <<
>>>       /Length 2852
>>>       >>
>>>       stream
>>>       (en-GB)/MCID 9 >>BDC
>>>       BT
>>>       9 0 0 9 145.7323 555.6879 Tm
>>>       (Some more text)Tj
>>>       ET
>>>       EMC
>>>
>>>
>>> Best Wishes,
>>> Malcolm
>>>
>>> On 9 November 2017 at 17:58, Tilman Hausherr <THausherr@t-online.de>
>>> wrote:
>>>> Hi,
>>>>
>>>> What PDFBox version are you using and can you upload the PDF to a
>>>> sharehoster? Splits between tokens shouldn't be a problem.
>>>>
>>>> Tilman
>>>>
>>>> PS: please don't start a new thread like you did today, this is
>>>> confusing.
>>>> Answer to yourself on the list instead.
>>>>
>>>>
>>>> Am 09.11.2017 um 09:53 schrieb Malcolm Vincent:
>>>>> Hi,
>>>>>
>>>>> I've been using PDFBox to read and write PDFs successfully for a while
>>>>> and have started running into a few issues recently.
>>>>>
>>>>> I seem to be getting the following errors when loading PDFs generated
>>>>> in Adobe InDesign / Acrobat Distiller (the PDFs render fine in Acrobat
>>>>> Reader, pdf.js and chrome).
>>>>>
>>>>> The first one seems to be a UI thing for the PDFReader function so I'm
>>>>> ignoring it.
>>>>>
>>>>> The second and third are the problem. They are both related. I get
>>>>> them when I use PDFBox in my own code as well as in the app, but since
>>>>> they are warnings they do not flag up as runtime errors I can catch.
>>>>>
>>>>> #1
>>>>> Nov 09, 2017 8:31:45 AM java.util.prefs.WindowsPreferences <init>
>>>>> WARNING: Could not open/create prefs root node Software\JavaSoft\Prefs
>>>>> at root 0x80000002. Windows RegCreateKeyEx(...) returned error code 5.
>>>>>
>>>>> #2
>>>>> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
>>>>> parseCOSDictionaryNameValuePair
>>>>> WARNING: Bad Dictionary Declaration
>>>>> org.apache.pdfbox.pdfparser.InputStreamSource@498fa7e0
>>>>>
>>>>> #3
>>>>> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
>>>>> parseCOSDictionary
>>>>> WARNING: Invalid dictionary, found: '?' but expected: '/' at offset 2861
>>>>>
>>>>> I have traced the problem to the following PDF content at the end of
>>>>> Page 1 Stream 1.
>>>>>
>>>>> /Span <</Lang (en-GB)/MCID 8 >>BDC
>>>>> BT
>>>>> 9 0 0 9 99.3376 555.6879 Tm
>>>>> (text string here)Tj
>>>>> ET
>>>>> EMC
>>>>> /Span <</Lang
>>>>> endstream
>>>>> endobj
>>>>>
>>>>> The last dictionary entry seems to be incomplete.
>>>>>
>>>>> When I go on to process the files in my own code, I iterate over the
>>>>> content stream, perform my function and replace the stream content,
>>>>> the stream ends up incorrect and the resulting PDFs will not load in
>>>>> Acrobat Reader (although they do in chrome).
>>>>>
>>>>> My options appear to be
>>>>>
>>>>> (a) grep the file for this and remove or overwrite it with a string
>>>>> operation before using PDFBox
>>>>>
>>>>> (b) update the source to cope with this condition
>>>>>
>>>>> (c) kick the PDF back as invalid - difficult since the file is a
>>>>> "valid" PDF that is generated in Adobe and reads ok in Adobe
>>>>>
>>>>> I have verified this by manually overtyping <</Lang with spaces
and
>>>>> then everything works perfectly in my own code and in PDFReader.
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>> Best wishes,
>>>>> Malcolm.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message