pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: OCR for PDFBox : Progress
Date Wed, 11 Jun 2014 01:46:45 GMT
Hi Dimuthu,

I cloned your code and did some experiments with it  - it’s working nicely. I’m glad that
subclassing
PDFTextStripper has been a success, it’s a nice clean implementation.

> Tesseract API [1]
> 
> 1. Currently all necessary functions were implemented and test cases were written in
order to check proper functionality
> 
> 2. Support for Mac and linux operating systems. In future I'll try to add support for
Windows also

That’s fine for now.

> 3. All static libs for Tesseract and Leptonica were pre built and added to resources
folder. 

Perfect.

> 4. At build phase it dynamically identify correct libs that support to particular Operating
system
> 
> 5. If some one needs to build above static libs manually, instructions were given in
read me.

> 6. In future, I'll work on adding those static libs creation when project  is built.
Currently they must be manually built.

That would be handy.

> OCR plugin [2]
> 
> 1. Almost finished implementing. 
> 
> 2. Working fine with sample PDF files I have given. Is there any set of PDF files that
can be used to test accuracy and performance?

Currently, no, but I’ll take a look in my collection of test files…

> In addition to that, there are some code formatting and commenting stuff to be done.

It might be nice to add a command line utility to your OCR-Plugin, you could copy ExtractText.java
from org.apache.pdfbox.tools and rename it to OCRText and have it use your PDFOCRTextStripper
class instead of PDFTextStripper. That way your plugin is immediately usable by end-users.

-- John


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message