pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: OCRing extracted inline images vs. fully rendered pages?
Date Tue, 17 May 2016 17:57:38 GMT

>We have an experimental integration with Tesseract which was created a while ago by a
GSoC student. Because it requires >building C++ we’ve not integrated it into trunk, but
do have it on the todo list for 2.1. 

Ah, very cool.  Y, I'd trust you all to do a better job of integrating OCR for PDFs than we'd
do. :)

>The advantage of this approach is that we can keep any embedded text in the PDF and embellish
it with the output.

It would be neat to have an OCR-only option for documents where the text extraction yields
complete garbage (...garbage detector...on our todo list TIKA-1443).

I'll hold off then on doing anything on our end.  Thank you!



View raw message