lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Singh <>
Subject Re: Text in images are not extracted and indexed to content
Date Tue, 10 Apr 2018 10:41:53 GMT
May need to extract outside SolR and index pure text with an external ingestion process. You
have much more control over the Tika attributes and behaviors.

Rahul Singh

Anant Corporation

On Apr 9, 2018, 10:23 PM -0400, Zheng Lin Edwin Yeo <>, wrote:
> Hi,
> Currently I am facing issue whereby the text in images file like jpg, bmp
> are not being extracted out and indexed. After the indexing, Tika did
> extract all the meta data out and index them under the fields attr_*.
> However, the content field is always empty for images file. For other types
> of document files like .doc, the content is extracted correctly.
> I have already updated the tika-parsers-1.17.jar, under
> \prg\apache\tika\parser\pdf\ for extractInlineImages to true.
> What could be the reason?
> I have just upgraded to Solr 7.3.0.
> Regards,
> Edwin

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message