jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jzitt...@adobe.com>
Subject Re: Tiff extraction question
Date Wed, 09 Mar 2011 10:38:09 GMT

On 03/09/2011 11:19 AM, Eliott wrote:
> during the final phase of a project came into my attention that tiff
> files are also capable of storing the image and the ocr-ed text in a
> same file, just like PDFs do. Since we have many of such files, we have
> a business need to extract text from these tiffs.
> Has anybody written a text extractor or knows a library that can get the
> text layer from these files? Is there any specific reason why JR does
> not support this out of the box?

Jackrabbit uses Apache Tika [1] that contains a parser for TIFF images. 
Currently the parser only extracts XMP and EXIF metadata embedded in 
TIFFs (and we've disabled it by default in Jackrabbit), but you might 
want to check to see if you can extend it to also handle such text layers.

[1] http://tika.apache.org/

Jukka Zitting

View raw message