manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1317) Hang crawling job on some ZIP documents
Date Sat, 21 May 2016 05:34:12 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294742#comment-15294742
] 

Karl Wright commented on CONNECTORS-1317:
-----------------------------------------

Generally the right way to debug this is to use the manifoldcf log.  It looks like you have
done that.

One version of Tika had license issues with some of its dependencies, so we were unable to
include them, which would have prevented extraction from specific file formats.  We have since
upgraded trunk to the latest Tika which does not have this issue.  I will verify that the
problem is fixed there.


> Hang crawling job on some ZIP documents
> ---------------------------------------
>
>                 Key: CONNECTORS-1317
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1317
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: File system connector
>    Affects Versions: ManifoldCF 2.3
>         Environment: Ubuntu 14.04 Linux 3.13.0-86-generic i686 i686
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
> DB: Postgres 9.5.1
>            Reporter: Mr.Keuz
>            Assignee: Karl Wright
>
> I use ManifolCF as file crawler. But I found, that crawling process hangs on some zip
files. Although some files parsing normally. 
> Steps: 
> 1. Run ManfoldCF by  "example/start.sh" and Posgres as DB
> 2. Create manifold pipeline: File -> Tika -> Solr
> 3. Put zip file in folder (in attach below)
> 4. Run job
> Here zip file that should reproduce bug: 
> "ManifoldCF_ISSUE_Dive.Into.Python.3.Mark.Pilgrim.2009.zip"
> https://yadi.sk/d/0uSdrR5GrsgmG 
> Note:
> As I investigated (by strace) - crawler process tries to open and parse same zip file
again and again (it seems from different workers threads). And It seems that document not
removes from queue.
> I am newbie in ManifoldCF, so it is hard task to me to find problem in source code.
> I can send some additional info if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message