pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karen Lindholm <karen.lindh...@gmail.com>
Subject Re: PDFBox - Does the PDF file version matter
Date Tue, 04 Feb 2014 14:11:02 GMT
Micha, Thank you, thank you, thank you! It is finally working. I am very
grateful.


On Tue, Feb 4, 2014 at 8:27 AM, Karen Lindholm <karen.lindholm@gmail.com>wrote:

> Thanks Micha for the explanation. I will try looking for the words
> preceding the text I want to extract. I appreciate your assistance and will
> let you know if I am successful.
>
>
> On Tue, Feb 4, 2014 at 8:20 AM, Michael Kuß <michael.kuss@mtc-berlin.com>wrote:
>
>> Hi Karen,
>>
>> first the PDF format is not designed to get text back. It is not an
>> editable format like text or word but more focused on displaying the
>> content.
>> Text in a PDF file is like a cloud of points cluttered over a white
>> space. You have to put the characters (if available) in the correct order
>> and insert spaces if needed. This pdfbox is doing to some extent.
>> But if you see Text e.g. in Acrobat Reader it is not necessary "text" but
>> it can also be a graphic.
>>
>> So, to your problem. Different PDF converter do handle the positioning of
>> text during a PDF conversion in different manners.
>> Some will produce just a graphic, that represents the printed result of
>> e.g. a word document as a PDF file.
>> Some will produce a PDF with text included. This text may be with spaces
>> or without and the text may be correctly positioned or not.
>> The converters mostly try to make an accurate representation in a layout
>> point of view. The focus is not to get content back from the PDF file. PDF
>> is not designed to do this.
>> If you have two different PDF converters the text extracted with pdfbox
>> may differ.
>> Thus if you must extract text from a PDF file with specific positioning
>> you have to do more intelligent steps.
>> Parse for known words or extend the framework to parse just a specific
>> position.
>>
>> To get a clue how the PDF format was created have a look here:
>> http://en.wikipedia.org/wiki/Pdf
>>
>> Hope this helps somehow.
>>
>> Kind regards,
>>   Micha
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message