pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Malcolm Vincent <malcolmvinc...@gmail.com>
Subject Re: Dictionary Issue
Date Thu, 09 Nov 2017 16:33:07 GMT
Following on from that analysis this appears to be the way to get the
tokens on a page and process them ...

        PDPage page = my_pdf.getPage(i);
        PDFStreamParser parser = new PDFStreamParser(page);
        parser.parse();
        page.setContents(processTokens(parser.getTokens()));

Note that even though you can access the streams individually you
should not do this because it is not safe.

Hope it helps someone else !!

Best Wishes,
Malcolm



On 2017-11-09 14:05, Malcolm Vincent <m...@gmail.com> wrote:
> Hi,>
>
> After more testing I can confirm the issue occurs when PDFBox is>
> parsing a stream where the token splits across this stream and the>
> next one is the problem.>
>
> i.e. the whole token does not occur in the stream being parsed>
>
> Perhaps there is a way to get all the tokens in the page content and>
> PDFBox reads the streams as necessary rather than using the individual>
> streams the way I am doing at the minute.>
>
> In this excerpt you can clearly see where the COSDictionary is split>
> across the stream boundary>
>
> /Span <</Lang (en-GB)/MCID 8 >>BDC>
> BT>
> 9 0 0 9 99.3376 555.6879 Tm>
> (Some text)Tj>
> ET>
> EMC>
> /Span <</Lang>
> endstream>
> endobj>
> 19 0 obj>
> <<>
> /Length 2852>
> >>>
> stream>
> (en-GB)/MCID 9 >>BDC>
> BT>
> 9 0 0 9 145.7323 555.6879 Tm>
> (Some more text)Tj>
> ET>
> EMC>
>
>
>
> Best Wishes,>
> Malcolm.>
>
> --------------------------------------------------------------------->
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org>
> For additional commands, e-mail: users-help@pdfbox.apache.org>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message