manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1481) Some documents cannot be Tika extracted due to classloader problem
Date Tue, 09 Jan 2018 16:57:01 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318743#comment-16318743
] 

Karl Wright commented on CONNECTORS-1481:
-----------------------------------------

So...  It appears that the issue might be a mismatch between the version of POI we included
in 2.9 (1.17), and the version of Tika that we shipped (1.16).  We could not ship the version
of POI that was compatible with 1.16 because that had a major security issue with XML XSS
injection.  We could technically have gone with Tika 1.17, though, since it was released in
September, but we overlooked that, unfortunately.

The probable solution: a point release that includes an update to Tika 1.17, with no other
code changes.  That would be this svn version:
r1820296

Also we probably want the fix for CONNECTORS-1478 as well:
r1818722



> Some documents cannot be Tika extracted due to classloader problem
> ------------------------------------------------------------------
>
>                 Key: CONNECTORS-1481
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1481
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Tika extractor
>    Affects Versions: ManifoldCF 2.9
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 2.10
>
>
> Here's the exception:
> {code}
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
~[?:?]
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375)
~[?:?]
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260)
~[?:?]
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205)
~[?:?]
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142)
~[?:?]
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142)
~[?:?]
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
~[?:?]
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]
>         at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
~[?:?]
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
~[?:?]
>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)
~[mcf-agents.jar:?]
>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
~[mcf-agents.jar:?]
>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
~[mcf-agents.jar:?]
>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)
~[mcf-agents.jar:?]
>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
~[mcf-pull-agent.jar:?]
>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
~[mcf-pull-agent.jar:?]
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
~[?:?]
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
[mcf-pull-agent.jar:?]
> {code}
> This may or may not be addressed by Tika 1.17 but nobody has tried it yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message