manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mr.Keuz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1317) Hang crawling job on some ZIP documents
Date Sat, 21 May 2016 06:03:12 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294754#comment-15294754
] 

Mr.Keuz commented on CONNECTORS-1317:
-------------------------------------

Also some important point in this case is next. 

Job eat cpu while it hang. It try re-parse failed document again and again.
So it not marked nor as failed not as skipped.

It would be great to stop job after parse all valid document. 
Mark invalid docs. And show convenient message. (If it possible of course)

Also, in this case, IF I correct understood, job will hang with any other "unknown" exception.



> Hang crawling job on some ZIP documents
> ---------------------------------------
>
>                 Key: CONNECTORS-1317
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1317
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: File system connector
>    Affects Versions: ManifoldCF 2.3
>         Environment: Ubuntu 14.04 Linux 3.13.0-86-generic i686 i686
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
> DB: Postgres 9.5.1
>            Reporter: Mr.Keuz
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 2.5
>
>
> I use ManifolCF as file crawler. But I found, that crawling process hangs on some zip
files. Although some files parsing normally. 
> Steps: 
> 1. Run ManfoldCF by  "example/start.sh" and Posgres as DB
> 2. Create manifold pipeline: File -> Tika -> Solr
> 3. Put zip file in folder (in attach below)
> 4. Run job
> Here zip file that should reproduce bug: 
> "ManifoldCF_ISSUE_Dive.Into.Python.3.Mark.Pilgrim.2009.zip"
> https://yadi.sk/d/0uSdrR5GrsgmG 
> Note:
> As I investigated (by strace) - crawler process tries to open and parse same zip file
again and again (it seems from different workers threads). And It seems that document not
removes from queue.
> I am newbie in ManifoldCF, so it is hard task to me to find problem in source code.
> I can send some additional info if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message