pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Malcolm Vincent <malcolmvinc...@gmail.com>
Subject Re: Stream parsing issue in multi-stream page
Date Mon, 05 Feb 2018 20:49:04 GMT
I had to do something similar recently myself. Don't. It doesn't work. You
have to read the page as one stream.

Cheers!
Malcolm.


On 5 February 2018 at 18:30, Esteban R <eruiz0@hotmail.com> wrote:

> I need to analyze the distribution of contents in the different streams (I
> cannot provide additional details due to a confidentiality aggreement).
> Then I may need to change some content in the streams and rewrite them. I
> also wanted to preserve the original structure of (many) streams, but it is
> not a hard requirement.
>
> Esteban
> ________________________________
> De: Maruan Sahyoun <sahyoun@fileaffairs.de>
> Enviado: lunes, 05 de febrero de 2018 04:19 p.m.
> Para: Esteban R
> Asunto: Re: Stream parsing issue in multi-stream page
>
> Hi,
> > Am 05.02.2018 um 17:14 schrieb Esteban R <eruiz0@hotmail.com>:
> >
> > Thanks for your answer. But I really need to process the streams one by
> one (a special requirement in my project).
>
> could you explain why this is the case? It is possible that tokens are
> spawning streams - so if you process one by one the parser wouldn't know
> about the continuation. So the result you posted initially is fine from
> that perspective.
>
> BR
> Maruan
>
> >
> > Anyways, your answer gave me an idea for detecting the issue: I can
> compare the tokens for the individual streams with the tokens from
> pdPage.getContents().... double processing, but still useful.
> >
> > Any other ideas are wellcome.
> >
> > Esteban
> > De: Maruan Sahyoun <sahyoun@fileaffairs.de>
> > Enviado: lunes, 05 de febrero de 2018 03:25 p.m.
> > Para: users@pdfbox.apache.org
> > Asunto: Re: Stream parsing issue in multi-stream page
> >
> > Hi,
> >
> >
> >
> > > Am 05.02.2018 um 15:43 schrieb Esteban R <eruiz0@hotmail.com>:
> > >
> > > Hello. I need to rewrite a PDPage with many streams, one by one
> (making some transformations, and there is a special need to do it one
> stream at a time). Parsing (and pdfdebug) returns "wrong" tokens if one
> command begins at the end of the first stream and ends at the begining of
> the next one. I'm using pdfbox-2.0.8.
> > >
> > > Rewriting the stream with those tokens produces a corrupted page.
> > > How could we re-write the page without getting a corrupted page?
> > > Or, at least, how can we detect this kind of failures (or this one)?
> > >
> > > Please find a simplified example here:
> > > http://www.filedropper.com/out3unc
> > >
> > > The first stream is:
> > > /F1 10 Tf
> > > BT
> > > 40 764.138 Td
> > > 0 -12.138 Td
> > > [
> > >
> > > and the second one is:
> > > (CD) ] TJ
> > > ET
> > >
> > > In this case, running the following code:
> > >        Iterator<PDStream> itStreams = pdPage.getContentStreams();
> > >        while (itStreams.hasNext()) {
> > >            PDStream pdstream = itStreams.next();
> > >            PDFStreamParser parser = new PDFStreamParser(pdstream.
> toByteArray());
> > >            parser.parse();
> > >            List<Object> tokens = parser.getTokens();
> > >            for (Object token: tokens){
> > >                System.out.println("Token: "+token);
> > >            }
> > >        }
> > >
> >
> > instead of using pdPage.getContentStreams() and parsing the stream
> individually use pdPage.getContents() and read all content into a byte[].
> You can then pass that to PDFStreamParser.
> >
> > That will give you this output
> >
> > Token: COSName{F1}
> > Token: COSInt{10}
> > Token: PDFOperator{Tf}
> > Token: PDFOperator{BT}
> > Token: COSInt{40}
> > Token: COSFloat{764.138}
> > Token: PDFOperator{Td}
> > Token: COSInt{0}
> > Token: COSFloat{-12.138}
> > Token: PDFOperator{Td}
> > Token: COSArray{[COSString{CD}]}
> > Token: PDFOperator{TJ}
> > Token: PDFOperator{ET}
> >
> > BR
> > Maruan
> >
> >
> > > shows:
> > > Token: COSName{F1}
> > > Token: COSInt{10}
> > > Token: PDFOperator{Tf}
> > > Token: PDFOperator{BT}
> > > Token: COSInt{40}
> > > Token: COSFloat{764.138}
> > > Token: PDFOperator{Td}
> > > Token: COSInt{0}
> > > Token: COSFloat{-12.138}
> > > Token: PDFOperator{Td}
> > > Token: COSArray{[]}                    !!!!! empty array detected, end
> of first stream
> > > Token: COSString{CD}                 !!!!! begining of second stream
> > > Token: COSNull{}                         !!!!! closing "]"
> > > Token: PDFOperator{TJ}
> > > Token: PDFOperator{ET}
> > >
> > >
> > > Esteban
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message