pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Malcolm Vincent <malcolmvinc...@gmail.com>
Subject Dictionary Issue
Date Thu, 09 Nov 2017 14:05:26 GMT
Hi,

After more testing I can confirm the issue occurs when PDFBox is
parsing a stream where the token splits across this stream and the
next one is the problem.

i.e. the whole token does not occur in the stream being parsed

Perhaps there is a way to get all the tokens in the page content and
PDFBox reads the streams as necessary rather than using the individual
streams the way I am doing at the minute.

In this excerpt you can clearly see where the COSDictionary is split
across the stream boundary

/Span <</Lang (en-GB)/MCID 8 >>BDC
BT
9 0 0 9 99.3376 555.6879 Tm
(Some text)Tj
ET
EMC
/Span <</Lang
endstream
endobj
19 0 obj
<<
/Length 2852
>>
stream
(en-GB)/MCID 9 >>BDC
BT
9 0 0 9 145.7323 555.6879 Tm
(Some more text)Tj
ET
EMC



Best Wishes,
Malcolm.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message