manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1317) Hang crawling job on some ZIP documents
Date Sat, 21 May 2016 06:21:12 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294767#comment-15294767
] 

Karl Wright commented on CONNECTORS-1317:
-----------------------------------------

Hi Mr. Keuz,

ManifoldCF's many threads do not understand this particular exception.  It is part of MCF's
design that when something bad happens and it doesn't know what it is, it restarts the thread
in question, rather than leaving the crawler in a bad state.  That is why you see this kind
of behavior.

In the ManifoldCF world, it is critical for individual connectors to characterize the kinds
of exceptions that they throw for this reason.  But for exceptions that are unexpected (as
this one is), by definition the connector cannot characterize the exception properly, because
it is unexpected.  If an exception *was* expected, then one must ask why not fix the actual
problem instead.

> Hang crawling job on some ZIP documents
> ---------------------------------------
>
>                 Key: CONNECTORS-1317
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1317
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: File system connector
>    Affects Versions: ManifoldCF 2.3
>         Environment: Ubuntu 14.04 Linux 3.13.0-86-generic i686 i686
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
> DB: Postgres 9.5.1
>            Reporter: Mr.Keuz
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 2.5
>
>
> I use ManifolCF as file crawler. But I found, that crawling process hangs on some zip
files. Although some files parsing normally. 
> Steps: 
> 1. Run ManfoldCF by  "example/start.sh" and Posgres as DB
> 2. Create manifold pipeline: File -> Tika -> Solr
> 3. Put zip file in folder (in attach below)
> 4. Run job
> Here zip file that should reproduce bug: 
> "ManifoldCF_ISSUE_Dive.Into.Python.3.Mark.Pilgrim.2009.zip"
> https://yadi.sk/d/0uSdrR5GrsgmG 
> Note:
> As I investigated (by strace) - crawler process tries to open and parse same zip file
again and again (it seems from different workers threads). And It seems that document not
removes from queue.
> I am newbie in ManifoldCF, so it is hard task to me to find problem in source code.
> I can send some additional info if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message