pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Kuß <michael.k...@mtc-berlin.com>
Subject RE: PDFBox - Does the PDF file version matter
Date Tue, 04 Feb 2014 13:20:31 GMT
Hi Karen,

first the PDF format is not designed to get text back. It is not an editable format like text
or word but more focused on displaying the content.
Text in a PDF file is like a cloud of points cluttered over a white space. You have to put
the characters (if available) in the correct order and insert spaces if needed. This pdfbox
is doing to some extent.
But if you see Text e.g. in Acrobat Reader it is not necessary "text" but it can also be a
graphic.

So, to your problem. Different PDF converter do handle the positioning of text during a PDF
conversion in different manners.
Some will produce just a graphic, that represents the printed result of e.g. a word document
as a PDF file.
Some will produce a PDF with text included. This text may be with spaces or without and the
text may be correctly positioned or not.
The converters mostly try to make an accurate representation in a layout point of view. The
focus is not to get content back from the PDF file. PDF is not designed to do this.
If you have two different PDF converters the text extracted with pdfbox may differ.
Thus if you must extract text from a PDF file with specific positioning you have to do more
intelligent steps.
Parse for known words or extend the framework to parse just a specific position.

To get a clue how the PDF format was created have a look here:
http://en.wikipedia.org/wiki/Pdf

Hope this helps somehow.

Kind regards,
  Micha

Mime
View raw message