nifi-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (NIFI-1815) Tesseract OCR Processor
Date Tue, 14 Feb 2017 18:47:41 GMT


ASF GitHub Bot commented on NIFI-1815:

Github user joewitt commented on the issue:
    I think we should close out this PR until licensing challenges are resolved and momentum
is restored.  It is certainly cool so even if we end up with a process that puts an abnormal
burden on the user for administration of it there may still be good use for it.  But for now
I'd advocate closing this.

> Tesseract OCR Processor
> -----------------------
>                 Key: NIFI-1815
>                 URL:
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Jeremy Dyer
>            Assignee: Jeremy Dyer
>         Attachments: 0006-changes-to-the-OCR-processor.patch,
> This ticket is a follow-up to NIFI-1718 minus the use of the Tika library
> Expose OCR capabilities through a new processor which uses the Tesseract library. Use
of this processor would require that Tesseract be installed on the NiFi host. Since the processor
will have a system dependency care must be taken to ensure that the overall NiFi cluster continues
to function properly in the absence of the Tesseract system dependency even though the OCR
processor itself will be unable to perform its duties. In the event that the system dependencies
are not detected the processor should display a validation warning rather than failing or
preventing the NiFi instance from booting properly.
> Properties expose to configure Tesseract
> tesseractPath - Path to tesseract installation folder, if not on system path.
> language - Language ID (e.g. "eng"); language dictionary to be used.
> pageSegMode - Tesseract page segmentation mode, defaults to 1.
> minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to Integer.MAX_VALUE.
> timeout - Maximum time (in seconds) to wait for the OCR process termination; defaults
to 120.

This message was sent by Atlassian JIRA

View raw message