nifi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeremy Dyer (JIRA)" <>
Subject [jira] [Commented] (NIFI-1718) Processor(s) to perform OCR
Date Tue, 19 Apr 2016 23:14:25 GMT


Jeremy Dyer commented on NIFI-1718:

[~dgoldenberg] I came to create a jira for a NiFi Tesseract processor today and stumbled across
this jira. Seems I'm a few days late. I created a purely Tesseract processor already accounts
for all of the bullet points you listed (and the ability to pass in raw configuration key/values)
but it doesn't use Tika as you have described here. I would be glad to contribute what I have
but wanted run it by you first since you specifically called out Tika and I'm not using that.
Would it be a big deal if my implementation didn't use Tika explicitly or are you needing
that for something else?

Just for reference here is a quick screen recording of what I have so far

> Processor(s) to perform OCR
> ---------------------------
>                 Key: NIFI-1718
>                 URL:
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Dmitry Goldenberg
> This ticket is a follow-up to NIFI-1717.
> Apache Tika by default performs OCR on image files such as PNG, BMP, JPEG, GIF, etc.
using Tesseract, assuming that it is installed and properly configured.
> Design issue: should ExtractMediaAttributes processor allow Tika to perform OCR or should
OCR be handled elsewhere, whether by a processor or by a service?  Could both models be allowed,
where ExtractMediaAttributes supports OCR but there's also a separate PerformOCR processor
and/or service?
> If OCR is supported on the ExtractMediaAttributes processor, it'd be best if it supported
the following OCR related options (which are exposed by Tika's TesseractOCRConfig class):
> * tesseractPath - Path to tesseract installation folder, if not on system path.
> * language - Language ID (e.g. "eng"); language dictionary to be used.
> * pageSegMode - Tesseract page segmentation mode, defaults to 1.
> * minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> * maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to Integer.MAX_VALUE.
> * timeout - Maximum time (in seconds) to wait for the OCR process termination; defaults
to 120.

This message was sent by Atlassian JIRA

View raw message