manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1481) Some documents cannot be Tika extracted due to classloader problem
Date Tue, 09 Jan 2018 15:46:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318624#comment-16318624
] 

Karl Wright commented on CONNECTORS-1481:
-----------------------------------------

Apparently the jars are in the right place, so this is likely a Tika bug.

> Some documents cannot be Tika extracted due to classloader problem
> ------------------------------------------------------------------
>
>                 Key: CONNECTORS-1481
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1481
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Tika extractor
>    Affects Versions: ManifoldCF 2.9
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 2.10
>
>
> Here's the exception:
> {code}
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
~[?:?]
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375)
~[?:?]
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260)
~[?:?]
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205)
~[?:?]
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142)
~[?:?]
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142)
~[?:?]
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
~[?:?]
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]
>         at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
~[?:?]
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
~[?:?]
>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)
~[mcf-agents.jar:?]
>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
~[mcf-agents.jar:?]
>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
~[mcf-agents.jar:?]
>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)
~[mcf-agents.jar:?]
>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
~[mcf-pull-agent.jar:?]
>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
~[mcf-pull-agent.jar:?]
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
~[?:?]
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
[mcf-pull-agent.jar:?]
> {code}
> This may or may not be addressed by Tika 1.17 but nobody has tried it yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message