pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: Problems Using PDFBox To Manually Track TextPosition
Date Sun, 16 Aug 2015 03:29:29 GMT

> On 14 Aug 2015, at 17:06, John Walker <johnw@newconceptsdev.com> wrote:
> 
> Hello,
> 
> 
> 
> I'm using PDFBox to parse the contentstream for a page in a PDF.   Based on
> the list of operations, there are two lines of text that I expect to be in
> very different places on the page vertically.  However, when the page is
> displayed in Sumatra or Acrobat, this text is vertically aligned.

I’d recommend subclassing PDFStreamEngine if you want to hook into the PDF operators, specifically
showTextString(s) and associated methods, such as showGlyph.

Parsing the stream yourself brings many challenges.

> 
> The method I'm using to predict text position has been accurate in the past.
> I'm not sure if the method is faulty, or if I'm mis-understanding the
> operation list I'm getting from PDFBox.
> 
> 
> 
> Here is the list of operations, with annotations explaining how I think they
> should impact vertical position of text cursor: 
> 
> 
> 
> http://pastebin.com/GUWWX3Kv
> 
> 
> 
> As you can see, I'm basically only moving my model of the cursor in reaction
> to Tm's and Td's.  (TJ's aren't relevant because text is horizontal and the
> y position is the one I'm tracking.)   I also ignored the cm, because
> there's a Tm right after it.

You’re definitely misunderstanding the operators. Tm doesn’t set the x and y values, it
specifies a matrix which is multiplied with the current Tm matrix in the graphics state. In
addition, the graphics state itself can be saved/restored via the q and Q operators. You’ll
also need to take the CTM into account (that’s the cm operator).

Anyway, don’t do that, use PDFStreamEngine instead.

— John

> 
> Am I mis-interpreting the PDF Operators (as I suspect)?  Is there any
> potential that this is a PDFBox issue?  
> 
> 
> 
> Thanks in advance!
> 
> 
> 
> -John 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message