pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kirillkh <kiril...@gmail.com>
Subject PDFTextStripper: space characters inside words
Date Tue, 07 Jun 2011 03:59:40 GMT
Hi,

I've encountered two issues with PDFTextStripper and discovered (imperfect)
workarounds for both. Can anyone from the maintainers please take a look at
the issues and at my patch (which is admittedly pretty hackish)?
The patch is based off trunk, but I only tested it with PDFBox 1.5.0.
https://github.com/kirillkh/pdfbox/commit/9a23c3956a96c276dfc677a0862c6954661b6d6a

1. With the attached document (I hope it will be accepted by the mailing
list... If not, contact me, and I'll send it to you directly.), I'm seeing
spaces interspersed inside certain words (e.g., in the second page's title.)
The document is in Hebrew (RTL), which might or might not matter.

While I don't know what exactly the code is doing, I got the impression that
the problem is caused by zero-width space characters. Looks like the
document was produced by software that incorrectly specified the width of
every space character as 0 and also inserted them at random places inside
the document. (Does that make any sense?.. In any case, that was my
impression.) I assume that a real PDF renderer just ignores such characters,
but PDFTextStripper outputs every such character as text. I've managed to
modify the code in a way that makes these space characters be ignored (see
the patch), but chances are it is not the best solution.

2. (RTL-specific) After working around the main issue, I've encountered
another one. In some cases, the zero-width space characters coincided with
word boundaries; since I removed them, PDFTextStripper switched to using the
average character width to determine word boundaries. This resulted in
special WordSeparator positions being inserted where spaces were before. The
problem with that is the PDFTextStripper.normalize() method for some reason
splits the text on these word boundaries (instead of splitting it on the
line boundaries) to perform visual-to-logical reordering. For some lines,
this results in words order being reversed (the characters inside words are
in the correct order, the words are ordered in reverse).

I solved this by outputting a space character for every WordSeparator
encountered by normalize(). Again, this worked for me with this document,
but I'm not sure that is the right way to go.


-Kirill

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message