manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mr.Keuz (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-1317) Hang crawling on some ZIP documents
Date Sat, 21 May 2016 02:06:12 GMT


Mr.Keuz commented on CONNECTORS-1317:

It seems I found problem. Problem is in missed dependencies:


Without this libraries I got next stacktraces:

FATAL 2016-05-21 04:55:16,633 (Worker thread '41') - Error tossed: org/jdom/input/JDOMParseException
java.lang.NoClassDefFoundError: org/jdom/input/JDOMParseException
	at org.apache.tika.parser.feed.FeedParser.parse(
	at org.apache.tika.parser.CompositeParser.parse(
	at org.apache.tika.parser.CompositeParser.parse(
	at org.apache.tika.parser.AutoDetectParser.parse(
	at org.apache.tika.parser.DelegatingParser.parse(
	at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(
	at org.apache.tika.parser.pkg.PackageParser.parseEntry(
	at org.apache.tika.parser.pkg.PackageParser.parse(
	at org.apache.tika.parser.CompositeParser.parse(
	at org.apache.tika.parser.CompositeParser.parse(
	at org.apache.tika.parser.AutoDetectParser.parse(
	at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(
	at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(
	at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(
	at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(
	at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(
Caused by: java.lang.ClassNotFoundException: org.jdom.input.JDOMParseException
	at java.lang.ClassLoader.loadClass(
	at sun.misc.Launcher$AppClassLoader.loadClass(
	at java.lang.ClassLoader.loadClass(
	... 23 more

> Hang crawling on some ZIP documents
> -----------------------------------
>                 Key: CONNECTORS-1317
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: File system connector
>    Affects Versions: ManifoldCF 2.3
>         Environment: Ubuntu 14.04 Linux 3.13.0-86-generic i686 i686
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
> DB: Postgres 9.5.1
>            Reporter: Mr.Keuz
> I use ManifolCF as file crawler. But I found, that crawling process hangs on some zip
files. Although some files parsing normally. 
> Steps: 
> 1. Run ManfoldCF by  "example/" and Posgres as DB
> 2. Create manifold pipeline: File -> Tika -> Solr
> 3. Put zip file in folder (in attach below)
> 4. Run job
> Here zip file that should reproduce bug: 
> ""
> Note:
> As I investigated (by strace) - crawler process tries to open and parse same zip file
again and again (it seems from different workers threads). And It seems that document not
removes from queue.
> I am newbie in ManifoldCF, so it is hard task to me to find problem in source code.
> I can send some additional info if needed.

This message was sent by Atlassian JIRA

View raw message