lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lala <>
Subject Solr dih extract text from inline images in pdf
Date Tue, 06 Mar 2018 08:36:27 GMT

I am working with solr7, indexing multilingual files existing in a folder,
using DIH (FileListEntityProcessor for the basic entity, &
TikaEntityProcessor for the child entity in configuration file).

My problem relies here: I want to extract texts from images inside PDF
files, that works fine with the /update/extract request handler where I set
the "parseContext.config" attribute to an xml file lets say "context.xml"
where I set the property "extractInlineImages" for the entry
[PDFParserConfig] to true. But I have no Idea how to set the
parseContext.Config in the DIH configuration??

I tried these approaches, none of them worked:

    - set tikaConfig attribute in dih config file to my "context.xml",
obviously won't work since tika config is different :.
    - set the parseContext.config attribute to my "\dataImport"
requestHandler, didn't work

I googled a lot with no result...I really really appreciate any help here!!

Sent from:

View raw message