pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kishore Babu <kb...@envistacorp.com>
Subject RE: extracting text from image using pdfbox
Date Mon, 15 Oct 2012 05:17:02 GMT
Hi Peter, 
Thank you very much for the reply. Unfortunately, the image I am dealing are the scanned one.


I will update my result if I succeed in using the mentioned line detection algorithms. 


Thanks & Regards, 




Kishore Babu I Developer 
email: kbabu@envistacorp.com
office: 040.66417681
www.envistacorp.com
Subscribe to enVista's Newsletter!
      









-----Original Message-----
From: peter.murray.rust@googlemail.com [mailto:peter.murray.rust@googlemail.com] On Behalf
Of Peter Murray-Rust
Sent: Saturday, 13 October, 2012 1:05 AM
To: users@pdfbox.apache.org
Subject: Re: extracting text from image using pdfbox

On Fri, Oct 12, 2012 at 2:47 PM, Kishore Babu <kbabu@envistacorp.com> wrote:

> Hi All,****
>
> Is it possible to extract text from an image (JPEG) using pdfbox or is 
> there any open source java code for this?****
>
> ** **
>
> This is a very difficult problem and to solve it completely requires a
large amount of applied artificial intelligence. There are no out-of-the box answers.

However in limited domains there may be heuristic solutions. I am doing exactly this for scientific
diagrams (and using PDFBox for parts of this) as an Open Source project.  The project will
go best when:
* there are lots of diagrams relating to the same subject
* the graphics strokes and characters are preserved as PDF primitives (paths and characters)
* the characters are in common simple fonts (e.g. Helvetica)

This we now have tools which will extract and interpret chemical structures and scientific
diagrams (graphs) with a promising degree of precision.

If the characters are present as bitmaps then it is much harder. OCR works best when:
* the fonts are simple and well-known
* there is clear whitespace between the characters
* the characters are aligned with the page axes and are not distorted
* there is no lossy compression algorithm.

I am going to attempt to decipher images in PDFs using PDFBox to extract the images and then
line detection algorithms such as http://en.wikipedia.org/wiki/Canny_edge_detector to fine
lines and characters. I am optimistic of significant progress but it will be slow and will
require heuristics.

The things that make the process harder or impossible are:
* scanned images - the images are often skewed and have variable contrast
* lossy compression such as JPEG. (Look at the JPEG and you will see small satellite pixels
from the wavelet algorithm. These make OCR much harder.

BTW if any other reader is interested in hacking (STM) scientific technical medical PDFs using
Java code layered on PDFBox and prepared to put in effort at alpha level I'd be delighted
to hear from you. But it is *alpha* at best - there are a lot of heuristics that change frequently.

--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
View raw message