manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Document connector excluding mime type and size - Tika Parser error
Date Tue, 09 Jan 2018 14:43:57 GMT
Since the Tika extractor essentially filters out the content mime type
(other than presenting it as metadata), you need to put an "allowed
documents" transformation connection into your job pipeline BEFORE the Tika
connector:

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

In fact, mime type exclusion is actually disabled in the Solr output
connector *unless* you are using the extracting update handler.  That
should resolve the one problem for you.

Thanks,
Karl


On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com> wrote:

> They document for Tika are :
>
> ·        Microsoft Word 97-2003
>
> ·        Application/msword
>
>
>
> I can’t have more informations, they are in SCO servers and SCO do not
> have ls –lisan or stat command.
>
>
>
> For SolR connecting, I seem to have emptied the index before the last
> indexation. (ManifoldCF and Solr) I do it again to be sure.
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:26
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1482 is for the Solr connector filtering issue.  A question:
> When you changed these fields in the output connection, had you already
> indexed any documents?  Those would only get cleaned up if you did a
> subsequent full crawl, after you made the connection change.
>
>
>
> Karl
>
>
>
>
>
>
>
> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com> wrote:
>
> If you let me know what kind of file they are (extension and what
> application created them) that is probably good enough.
>
> Karl
>
>
>
> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com> wrote:
>
> Okay good. I look if I can test 1.17 Tika version.
>
>
>
> I can’t transfert a document with this error, they are privates. Sorry.
>
>
>
> If I encounter the error again on a non-private document, I'll come back
> to you.
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 15:12
>
>
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> CONNECTORS-1481 is the ticket for the Tika problem.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com> wrote:
>
> Ok, if you are in a position to build trunk, that's a newer version of
> Tika (1.17) which might (or might not) address this problem.
>
>
>
> If you could create a ticket, I'd greatly appreciate attaching one
> document to it that causes the failure.
>
>
>
> Thanks!
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com> wrote:
>
> It’s a 2.9 version.
>
>
>
> I have a 2.8.1 in an other server with same job and same documents. I will
> test on this other server and make you a return.
>
>
>
> Thanks for your help.
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* mardi 9 janvier 2018 13:15
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Document connector excluding mime type and size - Tika
> Parser error
>
>
>
> I looked at the history of this.  We had to release a patch (2.8.1) that
> put various poi jars at root level in order to work around a Tika problem.
> That patch may not have been entirely correct in that it looks like it may
> have blocked access by one of the deeper jars to a higher level.
>
>
>
> Release 2.9 should fix this if I am correct.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com> wrote:
>
> What version of MCF is this?  That's important to know since Tika has had
> problems with this kind of thing in the past and this looks like something
> similar.
>
>
>
> The problem you are reporting is due to either a missing jar, or a bug in
> an internal tika classloader.  But I need to know whether this is a current
> bug or not, since we just went to a new Tika version.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com> wrote:
>
> Hello Karl,
>
> I hope you are well today.
>
>
>
> I have 2 problems with ManifoldCF.
>
>
>
> -----------
>
> In **Outputs connectors** with Solr connector. I have add a « Maximum
> document length and I have « Excluded 5 mime types » but it not work. I
> join capture.
>
>
>
> ----------
>
> And in second, I have a **Tika exception** in ManifoldCF. 3 documents are
> blocked :
>
>
>
> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed:
> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/
> apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.
> HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;
>
>         at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> ~[?:?]
>
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtract
> or.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.
> getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.
> OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142) ~[?:?]
>
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ~[?:?]
>
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.
> TikaParser.parse(TikaParser.java:74) ~[?:?]
>
>         at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.
> addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcept
> ion(IncrementalIngester.java:3226) ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester$PipelineObjectWithVersions.
> addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.agents.incrementalingest.
> IncrementalIngester.documentIngest(IncrementalIngester.java:756)
> ~[mcf-agents.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread$
> ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
>         at org.apache.manifoldcf.crawler.connectors.sharedrive.
> SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> I need to create an incident ticket?
>
>
>
> ----------
>
>
>
> Thanks for your help.
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Mime
View raw message