pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DImuthu Upeksha <dimuthu.upeks...@gmail.com>
Subject OCR for PDFBox : Progress
Date Mon, 02 Jun 2014 17:03:25 GMT
Hi John,
Here is the progress of OCR Plugin for PDFBox.
Project consists of two sub projects

1. Tesseract API for java
2. OCR Plugin for PDFBox using Tesseract API

*Tesseract API [1]*

1. Currently all necessary functions were implemented and test cases were
written in order to check proper functionality

2. Support for Mac and linux operating systems. In future I'll try to add
support for Windows also

3. All static libs for Tesseract and Leptonica were pre built and added to
resources folder.

4. At build phase it dynamically identify correct libs that support to
particular Operating system

5. If some one needs to build above static libs manually, instructions were
given in readme.

6. In future, I'll work on adding those static libs creation when project
 is built. Currently they must be manually built.

*OCR plugin [2]*

1. Almost finished implementing.

2. Working fine with sample PDF files I have given. Is there any set of PDF
files that can be used to test accuracy and performance?

In addition to that, there are some code formatting and commenting stuff to
be done.

[1] https://github.com/DImuthuUpe/Tesseract-API
[2] https://github.com/DImuthuUpe/OCR-Plugin

W.Dimuthu Upeksha
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message