pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: OCRing extracted inline images vs. fully rendered pages?
Date Tue, 17 May 2016 16:26:58 GMT

> On 17 May 2016, at 05:25, Allison, Timothy B. <tallison@mitre.org> wrote:
> 
> All,
>  On Tika, users can choose to run OCR on inline images (and attached images, of course).
 Would it be better for us to render each full page and then run OCR on that?

We have an experimental integration with Tesseract which was created a while ago by a GSoC
student. Because it requires building C++ we’ve not integrated it into trunk, but do have
it on the todo list for 2.1. The advantage of this approach is that we can keep any embedded
text in the PDF and embellish it with the output.

https://github.com/DImuthuUpe/OCR-Plugin

— John

>         Best,
> 
>                  Tim
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message