pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karen Lindholm <karen.lindh...@gmail.com>
Subject Re: PDFBox - Does the PDF file version matter
Date Tue, 04 Feb 2014 13:27:45 GMT
Thanks Micha for the explanation. I will try looking for the words
preceding the text I want to extract. I appreciate your assistance and will
let you know if I am successful.


On Tue, Feb 4, 2014 at 8:20 AM, Michael Kuß <michael.kuss@mtc-berlin.com>wrote:

> Hi Karen,
>
> first the PDF format is not designed to get text back. It is not an
> editable format like text or word but more focused on displaying the
> content.
> Text in a PDF file is like a cloud of points cluttered over a white space.
> You have to put the characters (if available) in the correct order and
> insert spaces if needed. This pdfbox is doing to some extent.
> But if you see Text e.g. in Acrobat Reader it is not necessary "text" but
> it can also be a graphic.
>
> So, to your problem. Different PDF converter do handle the positioning of
> text during a PDF conversion in different manners.
> Some will produce just a graphic, that represents the printed result of
> e.g. a word document as a PDF file.
> Some will produce a PDF with text included. This text may be with spaces
> or without and the text may be correctly positioned or not.
> The converters mostly try to make an accurate representation in a layout
> point of view. The focus is not to get content back from the PDF file. PDF
> is not designed to do this.
> If you have two different PDF converters the text extracted with pdfbox
> may differ.
> Thus if you must extract text from a PDF file with specific positioning
> you have to do more intelligent steps.
> Parse for known words or extend the framework to parse just a specific
> position.
>
> To get a clue how the PDF format was created have a look here:
> http://en.wikipedia.org/wiki/Pdf
>
> Hope this helps somehow.
>
> Kind regards,
>   Micha
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message