pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremias Maerki <...@jeremias-maerki.ch>
Subject Re: How does PDFBox extract text from a PDF?
Date Tue, 10 Jul 2012 18:32:58 GMT
Hi Jochen,

there is no "extra text layer", not even a "text layer" in PDF. Text
painting operators are just operators like those for painting lines,
curves and bitmaps.

When I wrote that OCR programs write white-on-white text behind the
scanned bitmap, that is usually the result of text operators being
placed before the painting of the bitmap. Thus the bitmap basically lies
over the text because it was painted after the text was painted. But
there is probably no actual "layer". The so-called "optional content
groups" (OCG, since PDF 1.5) are sometimes used to create something like
a "layer" which can be disabled and enabled etc. Good OCR programs
probably create an OCG if the text

If you want to know if a PDF is OCRed, just run PDFBox's text extraction.
If you get no text, you can probably try to run the OCR process. The
result of running OCR on an OCRed PDF is application-specific. There's
no single answer for that.

Here's an extract (with my comments) of a scanned page that I've run
through Readiris Pro 12 (not the best OCR tool BTW):

BT                         % begin text object
3 Tr                       % text rendering mode: fill
1 0 0 1 0 846 Tm           % text matrix (position, scale...)
138.48 -58.56 Td           % move text position
/F00 23 Tf                 % select font /F00, size 23 (internally mapped to TimesNewRoman)
(FS) Tj                    % write "FS"
34.8 0 Td                  % move text position
(Hotel-) Tj                % write "Hotel-" etc. etc.
75.12 0 Td
(Stuttgart) Tj
93.6 0 Td
(-) Tj
13.44 0 Td
(Böblingen) Tj
56.4 -65.28 Td
/F10 9 Tf
(Wolf) Tj
23.04 0 Td
(-) Tj
8.88 0 Td
(Hirth) Tj
23.28 0 Td
(-) Tj
8.64 0 Td
(Straße) Tj


6 0 Td                    % move text position
(Stuttgart) Tj            % write "Stuttgart"
ET                        % end text object
q                         % save graphics state
 601.92 0 0 846 0 0 cm    % concatenate transformation matrix (position, scale etc.)
 /img0 Do                 % Paint bitmap /img0 (the scanned page)
 Q                        % restore graphics state

So, just a bitmap painted over the recognized text. No layers, they
didn't even bother to paint the text in white.

Jochen, fire up PDFBox's PDFDebugger [1] and load a few PDFs and browse
through the object tree. Look around. That'll give you a feeling of
what's in a PDF. Then download the PDF specification. It's not written
in Hieroglyphs or Klingon. ;-)

[1] http://pdfbox.apache.org/commandlineutilities/PDFDebugger.html

Jeremias Maerki

On 10.07.2012 19:41:24 Jochen Hebbrecht wrote:
> Hi Jeremias,
> No, I'm not having any trouble at all :-). Just curious about the working
> mechanism of PDFBox. And how Adobe created its PDF format.
> At this page
> (http://en.wikipedia.org/wiki/Portable_Document_Format#Adobe.27s_versions),
> you can see all previous (and current) versions of the PDF format. Can any
> of this format support the text layer? How does Adobe call this "extra text
> layer"? There's no information on Wikipedia telling me the technical details
> about this "text layer".
> Can we detect using PDFBox if an image has been OCR'rd? Or do we just try to
> get the contents? And if contents is null, try to OCR with some kind of OCR
> engine?
> And what happens if we try to OCR a PDF which was already OCR'd? Do we have
> an extra "text layer"? So 1 image, 1 layer with first OCR and 1 layer with
> secondary OCR?
> Jochen
> -----Oorspronkelijk bericht-----
> Van: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
> Verzonden: dinsdag 10 juli 2012 16:11
> Aan: users@pdfbox.apache.org
> Onderwerp: Re: How does PDFBox extract text from a PDF?
> On 10.07.2012 15:36:02 Jochen Hebbrecht wrote:
> > My first question is: how is text stored in a PDF? I think there are 2 
> > ways to store text in a PDF:
> > a) vector PDF: the PDF contains a line telling it to print a word in a 
> > specific font on a specific location
> That's the usual case, yes.
> > b) OCR text has been added to the image as an extra layer (I think 
> > this is called, the XMP metadata)
> No, actually an OCR software usually just adds white-on-white text behind
> the bitmap. This would technically be like your a).
> XMP Metadata is really just for metadata, not actual text content.
> > Is this information correct?
> > 
> > So, if PDFBox wants to extract text from a PDF, how does it extract 
> > the data? Is it looking at the XMP metadata? Or the vector details?
> > Any developer wanting to help me on this issue?
> PDFBox interprets the text painting operators (as if it were painting the
> PDF), looks up the actual character for a code point (character "a"
> might be at code point 7 (or whatever) when a subset CID font is used, for
> example) and emits that as Unicode text. Well's that's simplified.
> There are some additional heuristics for things like placement and order of
> text but that doesn't really affect the actual process of extracting text.
> There is another location where a PDF can carry text but that's not
> supported by PDFBox, AFAIK: the "ActualText" entries of tagged PDFs can
> contain text of artifacts on a page (ex. an image). That's used for enabling
> visually impaired people to read certain documents.
> I guess the question is: what are you trying to do? Do you have a problem
> you're trying to solve?
> If you want to learn about how text is put into a PDF, run PDFBox's
> PDFDebugger and open a random PDF. That allows you to explore all the
> details of a PDF. Quite enlightening if you don't know the PDF specification
> by heart.
> Jeremias Maerki

View raw message