nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From olegz <...@git.apache.org>
Subject [GitHub] nifi pull request: NIFI-1815
Date Tue, 31 May 2016 20:44:57 GMT
Github user olegz commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/397#discussion_r65259237
  
    --- Diff: nifi-nar-bundles/nifi-ocr-bundle/nifi-ocr-processors/src/main/java/org/apache/nifi/processors/ocr/TesseractOCRProcessor.java
---
    @@ -0,0 +1,359 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.nifi.processors.ocr;
    +
    +import java.awt.image.BufferedImage;
    +import java.io.File;
    +import java.io.FileFilter;
    +import java.io.IOException;
    +import java.io.InputStream;
    +import java.io.OutputStream;
    +import java.util.ArrayList;
    +import java.util.Collections;
    +import java.util.HashMap;
    +import java.util.HashSet;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Set;
    +import java.util.concurrent.atomic.AtomicBoolean;
    +
    +import javax.imageio.ImageIO;
    +
    +import org.apache.commons.lang3.StringUtils;
    +import org.apache.nifi.annotation.behavior.InputRequirement;
    +import org.apache.nifi.annotation.documentation.CapabilityDescription;
    +import org.apache.nifi.annotation.documentation.Tags;
    +import org.apache.nifi.annotation.lifecycle.OnScheduled;
    +import org.apache.nifi.components.AllowableValue;
    +import org.apache.nifi.components.PropertyDescriptor;
    +import org.apache.nifi.components.ValidationContext;
    +import org.apache.nifi.components.ValidationResult;
    +import org.apache.nifi.components.Validator;
    +import org.apache.nifi.flowfile.FlowFile;
    +import org.apache.nifi.processor.AbstractProcessor;
    +import org.apache.nifi.processor.ProcessContext;
    +import org.apache.nifi.processor.ProcessSession;
    +import org.apache.nifi.processor.Relationship;
    +import org.apache.nifi.processor.exception.ProcessException;
    +import org.apache.nifi.processor.io.StreamCallback;
    +import org.apache.nifi.processor.util.StandardValidators;
    +
    +import net.sourceforge.tess4j.ITesseract;
    +import net.sourceforge.tess4j.Tesseract;
    +import net.sourceforge.tess4j.TesseractException;
    +
    +@Tags({"ocr", "tesseract", "image", "text"})
    +@InputRequirement(InputRequirement.Requirement.INPUT_REQUIRED)
    +@CapabilityDescription("Extracts text from images using Optical Character Recognition
(OCR). The images are pulled from the incoming" +
    +        " Flowfile's content. Supported image types are TIFF, JPEG, GIF, PNG, BMP, and
PDF. Any Flowfile that doesn't contain" +
    +        " a supported image type in its content body will be routed to the 'unsupported
image format' relationship and no OCR will be performed." +
    +        " This processor uses Tesseract to perform its duties and part of that requires
that a valid Tesseract data (Tessdata) directory" +
    +        " be specified in the 'Tessdata Directory' Property. This processor considers
a valid Tessdata directory to be an existing directory on the" +
    +        " local NiFi instance that contains one or more files ending with the '.traineddata'
extension. The list of supported languages" +
    +        " is built from the Tessdata directory configured by listing all files ending
with '.traineddata' and considering those" +
    +        " Tesseract language models. You can create you own Tesseract language models
and place them in your Tessedata directory" +
    +        " and the processor will display it in the dropdown list of languages available.
All valid Tesseract configuration values" +
    +        " may be passed to this processor by use of the 'Tesseract configuration values'
which accepts a comma separated list" +
    +        " of key=value pairs representing Tesseract configurations. 'Tesseract configuration
values' is where all of your tuning" +
    +        " values can be passed in to help increase the accuracy of your OCR operations
based on your expected input images." +
    +        " TesseractOCRProcessor only supports installations of Tesseract version 3.0
and greater.")
    +public class TesseractOCRProcessor extends AbstractProcessor {
    +
    +    public static Set<String> SUPPORTED_LANGUAGES;
    +    private static final String TESS_LANG_EXTENSION = ".traineddata";
    +    private static List<AllowableValue> PAGE_SEGMENTATION_MODES;
    +    private static volatile ITesseract tesseract;
    --- End diff --
    
    That is not what I was asking about. 
    The line in question is this:
    ```
    private static volatile ITesseract tesseract
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message