manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konrad Holl (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CONNECTORS-1287) Additional TikaOCR Configuration Options
Date Thu, 31 Mar 2016 07:27:25 GMT

     [ https://issues.apache.org/jira/browse/CONNECTORS-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Konrad Holl updated CONNECTORS-1287:
------------------------------------
    Description: 
For a client project I needed to enable OCR for images inside PDFs. Unfortunately ManifoldCF
does not provide configuration options to handle this. It would be nice to have these options
for the Tika content extraction:

1.	Enable PDF image extraction for OCR: https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
2.	Set default language for tesseract: https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29

Tika OCR is based on tesseract, an Open Source OCR library intially developed by Hewlett-Packard
and later continued by Google. It is available from https://github.com/tesseract-ocr/tesseract
. It needs to be installed with the tesseract binary available in the PATH environment variable
- alternatively it can be set using an Tika API method. Once it is installed and Tika is configured
correctly, it works like a charm.

When indexing images or PDFs containing images instead of real text, OCR is necessary for
making those documents searchable.


  was:
For a client project I needed to enable OCR for images inside PDFs. Unfortunately ManifoldCF
does not provide configuration options to handle this. It would be nice to have these options
for the Tika content extraction:

1.	Enable PDF image extraction for OCR: https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
2.	Set default language for tesseract: https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29

Tika OCR is based on tesseract, an Open Source OCR library intially developed by Hewlett-Packard
and later continued by Google. It is available from https://github.com/tesseract-ocr/tesseract
. It needs to be installed with the tesseract binary available in the PATH environment variable
- alternatively it can be set using an Tika API method.

When indexing images or PDFs containing images instead of real text, OCR is necessary for
making those documents searchable.



> Additional TikaOCR Configuration Options
> ----------------------------------------
>
>                 Key: CONNECTORS-1287
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1287
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Tika extractor
>    Affects Versions: ManifoldCF 2.3
>            Reporter: Konrad Holl
>            Assignee: Karl Wright
>            Priority: Minor
>             Fix For: ManifoldCF 2.4
>
>
> For a client project I needed to enable OCR for images inside PDFs. Unfortunately ManifoldCF
does not provide configuration options to handle this. It would be nice to have these options
for the Tika content extraction:
> 1.	Enable PDF image extraction for OCR: https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
> 2.	Set default language for tesseract: https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29
> Tika OCR is based on tesseract, an Open Source OCR library intially developed by Hewlett-Packard
and later continued by Google. It is available from https://github.com/tesseract-ocr/tesseract
. It needs to be installed with the tesseract binary available in the PATH environment variable
- alternatively it can be set using an Tika API method. Once it is installed and Tika is configured
correctly, it works like a charm.
> When indexing images or PDFs containing images instead of real text, OCR is necessary
for making those documents searchable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message