pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: extracting text from image using pdfbox
Date Fri, 12 Oct 2012 19:35:10 GMT
On Fri, Oct 12, 2012 at 2:47 PM, Kishore Babu <kbabu@envistacorp.com> wrote:

> Hi All,****
>
> Is it possible to extract text from an image (JPEG) using pdfbox or is
> there any open source java code for this?****
>
> ** **
>
> This is a very difficult problem and to solve it completely requires a
large amount of applied artificial intelligence. There are no out-of-the
box answers.

However in limited domains there may be heuristic solutions. I am doing
exactly this for scientific diagrams (and using PDFBox for parts of this)
as an Open Source project.  The project will go best when:
* there are lots of diagrams relating to the same subject
* the graphics strokes and characters are preserved as PDF primitives
(paths and characters)
* the characters are in common simple fonts (e.g. Helvetica)

This we now have tools which will extract and interpret chemical structures
and scientific diagrams (graphs) with a promising degree of precision.

If the characters are present as bitmaps then it is much harder. OCR works
best when:
* the fonts are simple and well-known
* there is clear whitespace between the characters
* the characters are aligned with the page axes and are not distorted
* there is no lossy compression algorithm.

I am going to attempt to decipher images in PDFs using PDFBox to extract
the images and then line detection algorithms such as
http://en.wikipedia.org/wiki/Canny_edge_detector to fine lines and
characters. I am optimistic of significant progress but it will be slow and
will require heuristics.

The things that make the process harder or impossible are:
* scanned images - the images are often skewed and have variable contrast
* lossy compression such as JPEG. (Look at the JPEG and you will see small
satellite pixels from the wavelet algorithm. These make OCR much harder.

BTW if any other reader is interested in hacking (STM) scientific technical
medical PDFs using Java code layered on PDFBox and prepared to put in
effort at alpha level I'd be delighted to hear from you. But it is *alpha*
at best - there are a lot of heuristics that change frequently.

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message