pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jochen Hebbrecht <jochenhebbre...@gmail.com>
Subject Fwd: How does PDFBox extract text from a PDF?
Date Tue, 10 Jul 2012 13:36:02 GMT
My first question is: how is text stored in a PDF? I think there are 2 ways
to store text in a PDF:
a) vector PDF: the PDF contains a line telling it to print a word in a
specific font on a specific location
b) OCR text has been added to the image as an extra layer (I think this is
called, the XMP metadata)

Is this information correct?

So, if PDFBox wants to extract text from a PDF, how does it extract the
data? Is it looking at the XMP metadata? Or the vector details?
Any developer wanting to help me on this issue?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message