pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Hirsh <joelehi...@gmail.com>
Subject Is this a bug in PDFStreamEngine?
Date Sun, 04 May 2014 19:03:19 GMT
 I am using PDFTextStripper and getting odd results on some strings that I
tracked down to something that I think may be a bug in PDFStreamEngine.

The PDF file has some text that looks like "1234" in Acrobat, but comes
through as "1 2 3 4" from PDFTextStripper.  The logic in PDFTextStripper is
putting in spaces because of a large inter-character spacing.

Tracing it down, the PDF file has a 'Tc' (spacing operator) followed by a
'Tm' (matrix operator) with a scale of 8.  Other PDF files that I could
find with 'Tc' operators had the 'Tc' after the matrix operator.

What strikes me as incorrect is that PDFStreamEngine does not distinguish
between a 'Tc' followed by 'Tm' versus a 'Tm' followed by 'Tc' .  In either
case the spacing in the 'Tc' is multiplied by the scale factor in the
matrix.   There is nothing in the Adobe PDF spec that specifically
addresses order of transforms, but in normal mathematics there is big
difference.  And in the case that looks incorrect, the spacing is being
multiplied by the scale in the matrix, and the results would be more like
Acrobat if it didn't.

Can someone who might have more knowledge about PDFStreamEngine/
PDFTextStripper comment on this?  The code that does the multiply is in
PDFStreamEngine.processEncodedText when it is operating on the value in
characterSpacingText.

Thanks

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message