manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From msaunier <msaun...@citya.com>
Subject RE: Document connector excluding mime type and size - Tika Parser error
Date Thu, 11 Jan 2018 17:10:45 GMT
 

Ok. I'll confirm that tomorrow.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 18:09
À : user@manifoldcf.apache.org
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

No Tika error is good, but have a look at Simple History to be sure documents were actually
processed.  If you can confirm that, I'll kick off the patch process.

 

Karl

 

 

On Thu, Jan 11, 2018 at 11:26 AM, msaunier <msaunier@citya.com <mailto:msaunier@citya.com>
> wrote:

Ok. So.

 

With the same configuration but Tika 1.17 :

 

·        No Tika error

·        But, no documents send to Solr. I don’t understand why. I research.

 

 

 

 

De : msaunier [mailto:msaunier@citya.com <mailto:msaunier@citya.com> ] 
Envoyé : jeudi 11 janvier 2018 15:32


À : user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

I crawl for the moment. I think, I would have finished in 30 minutes. 

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 11 janvier 2018 15:05
À : user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Did this work for you?

Karl

 

On Thu, Jan 11, 2018 at 6:36 AM, Karl Wright <daddywri@gmail.com <mailto:daddywri@gmail.com>
> wrote:

If you need the jcifs connector, run "ant make-deps" too.  Then run "ant build" again.

 

Karl

 

On Thu, Jan 11, 2018 at 4:30 AM, msaunier <msaunier@citya.com <mailto:msaunier@citya.com>
> wrote:

Hello Karl,

 

I have build and configured but WindowsShare connector do not appear in the list of repository
connectors.

 

·        I have add jcifs.jar into the connectors/jcifs/lib-proprietary directory

·        I have ant make-core-deps

·        Ant build

·        Uncomment windows share into the connectors-proprietary.xml file in the dist folder

·        I have add jcifs.jar in connector-lib-proprietary

 

But not have the proposition on the manifold interface. 

 

Any idea ?

Thanks.

 

 

De : msaunier [mailto:msaunier@citya.com <mailto:msaunier@citya.com> ] 
Envoyé : mercredi 10 janvier 2018 18:15
À : user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org> 
Objet : RE: Document connector excluding mime type and size - Tika Parser error

 

Good !

 

I configure and test that.

I give you a return as soon as the reading is finished.

400k documents.

 

If it works, I test on few million of documents.

 

Thank.

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mercredi 10 janvier 2018 17:45
À : user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

The build you should be using is the ant build.  Do not use the maven build for this purpose.

 

- Check out trunk:

 

svn co https://svn.apache.org/repos/asf/manifoldcf/trunk

 

- Download dependencies:

 

ant make-core-deps

 

- Build:

 

ant build

 

- Your deliverable is in the "dist" directory

 

Karl

 

 

On Wed, Jan 10, 2018 at 11:37 AM, msaunier <msaunier@citya.com <mailto:msaunier@citya.com>
> wrote:

I have an error with the maven build, so I have test with an external 1.17 Tika Server but,
POI not included. If you success a mvn package with 1.17 Tika, I am interested.

 

Today, I have not had much time to deal with it.

 

I found some bugs that I would declare tomorrow if they are not already. They concern log4j2,
local_fr and a bug with the web interface and the keyboard input key.

 

I continu my investigation.

 

De : Karl Wright [mailto:daddywri@gmail.com <mailto:daddywri@gmail.com> ] 
Envoyé : mercredi 10 janvier 2018 17:15


À : user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Any news?

Karl

 

On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <daddywri@gmail.com <mailto:daddywri@gmail.com>
> wrote:

Let me know what happens.
If it works for you, I'll see if we can put together a patch release of 2.9 with the fix.

 

Karl

 

 

On Tue, Jan 9, 2018 at 11:07 AM, msaunier <msaunier@citya.com <mailto:msaunier@citya.com>
> wrote:

Test check out and building with POI 3.17 and Tika 1.17? 

 

It’s possible.

 

I finish a project and I test that.

 

De : Karl Wright [mailto:daddywri@gmail.com <mailto:daddywri@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 16:57


À : user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order to deal with the classloader
issue present in POI 3.15, and because POI 3.16 has a severe security issue that made it impossible
to ship with.

 

Unfortunately that doesn't quite work; POI 3.17 is not backwards compatible with 3.16 completely
and therefore problems occur with this combination.

 

The probable solution is to check out and build trunk and see if that works for you.  It very
well might.  The question then is what to do next, because we are not scheduled to release
again until April.  We might have to do a point release to deal with this.

 

Please give it a try and let me know what happens.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <daddywri@gmail.com <mailto:daddywri@gmail.com>
> wrote:

Ok, never mind that last email.  We patched it in part in 2.9 by including the latest POI.
 So clearly it's still an existing problem in POI.  I'll have to open a ticket there and await
a patch from them.

 

Karl

 

On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <daddywri@gmail.com <mailto:daddywri@gmail.com>
> wrote:

This screenshot cannot be MCF 2.9 since the version of poi was not 3.17 for the 2.9 release.

 

Karl

 

 

On Tue, Jan 9, 2018 at 10:02 AM, msaunier <msaunier@citya.com <mailto:msaunier@citya.com>
> wrote:

They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers.

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <mailto:daddywri@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:54


À : user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

As for the Tika issue, we explicitly tested documents of that type when rolling out 2.8.1.
 When we updated 2.8.1 to a new Tika in 2.9 I believe we also tested this.

 

One of the potential issues is that if you are dropping down different versions of ManifoldCF
into the same directories you *might* have a poi* jar in the wrong place because of the way
we had to do the patch.  Please have a look at where the poi* jars are in your directory structure;
they should all be in one directory (connector-common-lib).  If you see any anywhere else,
that's the cause of the issue.

 

Karl

 

 

On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <daddywri@gmail.com <mailto:daddywri@gmail.com>
> wrote:

Since the Tika extractor essentially filters out the content mime type (other than presenting
it as metadata), you need to put an "allowed documents" transformation connection into your
job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector *unless* you
are using the extracting update handler.  That should resolve the one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <msaunier@citya.com <mailto:msaunier@citya.com>
> wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls –lisan
or stat command.

 

For SolR connecting, I seem to have emptied the index before the last indexation. (ManifoldCF
and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:daddywri@gmail.com <mailto:daddywri@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When you changed these
fields in the output connection, had you already indexed any documents?  Those would only
get cleaned up if you did a subsequent full crawl, after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <daddywri@gmail.com <mailto:daddywri@gmail.com>
> wrote:

If you let me know what kind of file they are (extension and what application created them)
that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <msaunier@citya.com <mailto:msaunier@citya.com>
> wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:daddywri@gmail.com <mailto:daddywri@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <daddywri@gmail.com <mailto:daddywri@gmail.com>
> wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika (1.17) which might
(or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to it that causes
the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <msaunier@citya.com <mailto:msaunier@citya.com>
> wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test on this other
server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:daddywri@gmail.com <mailto:daddywri@gmail.com> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put various poi jars
at root level in order to work around a Tika problem.  That patch may not have been entirely
correct in that it looks like it may have blocked access by one of the deeper jars to a higher
level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <daddywri@gmail.com <mailto:daddywri@gmail.com>
> wrote:

What version of MCF is this?  That's important to know since Tika has had problems with this
kind of thing in the past and this looks like something similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an internal tika
classloader.  But I need to know whether this is a current bug or not, since we just went
to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <msaunier@citya.com <mailto:msaunier@citya.com>
> wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document length and I
have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked :

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375)
~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260)
~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205)
~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142)
~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142)
~[?:?]

        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
~[?:?]

        at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
~[?:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)
~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
~[mcf-agents.jar:?]

        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)
~[mcf-agents.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
~[mcf-pull-agent.jar:?]

        at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
~[?:?]

        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Mime
View raw message