manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingchun Zhao (JIRA)" <>
Subject [jira] [Reopened] (CONNECTORS-1079) the parsing in TikaExtractor always return empty result
Date Sat, 25 Oct 2014 17:04:34 GMT


Mingchun Zhao reopened CONNECTORS-1079:

Hi Karl,

Thank you for your help, I've tried your fix.
Unfortunately, this symptom still occurs even we have two ika-core.jar in both of lib and
connector-lib directory.
It looks like that the two same jars cause jar conflict.
I tried to use ClassLoader to fix it, but gave up eventually. because that makes things more

Could you please confirm my suggestion as below:

1. Get rid of the tika-core.jar from lib directory(need to modify build.xml?)

2. Directly call Tika().detect to get MimeType instead of calling ExtensionMimeMap.mapToMimeType.
The related connectors as below(4 files):

3.Delete unused ExtensionMimeMap class which just contains one method to call Tika().detect
to get MimeType.


> the parsing in TikaExtractor always return empty result
> -------------------------------------------------------
>                 Key: CONNECTORS-1079
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Tika extractor
>    Affects Versions: ManifoldCF 2.0
>            Reporter: Mingchun Zhao
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.8, ManifoldCF 2.0
> When I use latest trunk source(2.0) to try the Tika content extractor,It did not return
any expected results.
> I looked at it using debugging tools, found that the parser of Tika content extractor
does not return any data.
> I've tried to move lib/tika-core-1.6.jar into connector-lib/, 
> Then, the Tika content extractor returned data as expected.
> My configurations are as below:
> ==
> Transformation:
>  Type: Tika content extractor
> Output:
>  Type:Solr(Use extract update handler=false)
> Repository:
>  type: Web
> Job:
>  1.type: repository
>  2.type: transformation
>  3.type: output
> ==
> Maybe, it is related to CONNECTORS-1074(?), 
> It looks like that the place of tika-core-1.6.jar affects the result of TikaExtractor.

This message was sent by Atlassian JIRA

View raw message