pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler <andr...@lehmi.de>
Subject Re: Is this a bug in PDFStreamEngine?
Date Tue, 06 May 2014 10:10:47 GMT

> Joel Hirsh <joelehirsh@gmail.com> hat am 4. Mai 2014 um 21:03 geschrieben:
>  I am using PDFTextStripper and getting odd results on some strings that I
> tracked down to something that I think may be a bug in PDFStreamEngine.
> The PDF file has some text that looks like "1234" in Acrobat, but comes
> through as "1 2 3 4" from PDFTextStripper.  The logic in PDFTextStripper is
> putting in spaces because of a large inter-character spacing.
> Tracing it down, the PDF file has a 'Tc' (spacing operator) followed by a
> 'Tm' (matrix operator) with a scale of 8.  Other PDF files that I could
> find with 'Tc' operators had the 'Tc' after the matrix operator.
Both parameters are optional, so that their usage is maybe completely different
when comparing two pdfs.

> What strikes me as incorrect is that PDFStreamEngine does not distinguish
> between a 'Tc' followed by 'Tm' versus a 'Tm' followed by 'Tc' .  In either
> case the spacing in the 'Tc' is multiplied by the scale factor in the
> matrix.   There is nothing in the Adobe PDF spec that specifically
> addresses order of transforms, but in normal mathematics there is big
> difference.  And in the case that looks incorrect, the spacing is being
> multiplied by the scale in the matrix, and the results would be more like
> Acrobat if it didn't.
I guess there is a misunderstanding. Both operator don't do any calculations,
they just set/replace some values. Other operators like 'Tj' uses those values
for calculations, so that the order of those operators isn't relevant.
in your case it's a simple scaling using scalar values, which is a commutative
operation and the order of the operands doesn't matter.

> Can someone who might have more knowledge about PDFStreamEngine/
> PDFTextStripper comment on this?  The code that does the multiply is in
> PDFStreamEngine.processEncodedText when it is operating on the value in
> characterSpacingText.
Can you share the pdf with us, so that we can have a look to see what might be

> Thanks

Andreas Lehmkühler

View raw message