lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Singh <rahul.xavier.si...@gmail.com>
Subject Re: Text in images are not extracted and indexed to content
Date Tue, 10 Apr 2018 10:41:53 GMT
May need to extract outside SolR and index pure text with an external ingestion process. You
have much more control over the Tika attributes and behaviors.

--
Rahul Singh
rahul.singh@anant.us

Anant Corporation

On Apr 9, 2018, 10:23 PM -0400, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>, wrote:
> Hi,
>
> Currently I am facing issue whereby the text in images file like jpg, bmp
> are not being extracted out and indexed. After the indexing, Tika did
> extract all the meta data out and index them under the fields attr_*.
> However, the content field is always empty for images file. For other types
> of document files like .doc, the content is extracted correctly.
>
> I have already updated the tika-parsers-1.17.jar, under
> \prg\apache\tika\parser\pdf\ for extractInlineImages to true.
>
>
> What could be the reason?
>
> I have just upgraded to Solr 7.3.0.
>
> Regards,
> Edwin

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message