pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Hirsh <joelehi...@gmail.com>
Subject Processing PDF file with underscores and identifying bold text
Date Fri, 06 Jan 2017 16:26:58 GMT
I have a PDF file that uses imbedded underscores to identify headers.  It
also uses lots of zero length spaces which additionally confuses things.
So if a period represents a zero length space, I might get back a string
from PDFBox text parsing that is something like.

n.orm.al. _H.E__A.D_E_.R

where there is 'normal' text and 'header' text in the same string. It is
pretty ugly, but that's what there is.

I can scan that correctly, but I would like identity the Header text as
such, and consider it equivalent to Bold text.  I was looking into a way to
do that with the TextPosition, but since it is Final there is no way to add
a field to contain that piece of information. It is not a flag to apply to
the whole string, just the characters that are underlined.  Could you
perhaps suggest an elegant way to do that.

Thanks

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message