pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Craig Ringer <ring...@ringerc.id.au>
Subject Re: How does PDFBox extract text from a PDF?
Date Tue, 10 Jul 2012 23:34:16 GMT
On 07/10/2012 10:10 PM, Jeremias Maerki wrote:
> On 10.07.2012 15:36:02 Jochen Hebbrecht wrote:
>> My first question is: how is text stored in a PDF? I think there are 2 ways
>> to store text in a PDF:
>> a) vector PDF: the PDF contains a line telling it to print a word in a
>> specific font on a specific location
There are actually two cases here:

(1) PDF text operators (BT, ET, Tj), used to convert (strings) etc to 
text using a font; or
(2) Vector line drawing using bezier curves, etc to represent glyphs.

The former can be extracted by fop. The latter, which is common in 
desktop publishing, needs OCR or special vector-to-font matching 
analysis and AFAIK cannot be processed by fop.

> There is another location where a PDF can carry text but that's not 
> supported by PDFBox, AFAIK: the "ActualText" entries of tagged PDFs 
> can contain text of artifacts on a page (ex. an image). That's used 
> for enabling visually impaired people to read certain documents.
It's also generally an unmangled, linebreak-free, column-free version of 
the text, which can be a real bonus. When it's there - and when it's 
correct, because of course there are tools out there that generate 
ActualText entreis full of invalid garbage or empty ActualText entries.

Craig Ringer

View raw message