manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Steenbeke (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
Date Mon, 31 Dec 2018 10:33:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731261#comment-16731261
] 

Tim Steenbeke commented on CONNECTORS-1562:
-------------------------------------------

[~kwright@metacarta.com] I think there was some misscommunication: 


    The Issue with the "stopped working" was found by my college Donald Van den Driessche,
so I didn't have any more info than what he gave me.
    I recreated the issue and this is the Error:
{code:java}
Error: Repeated service interruptions - failure processing document: Stream Closed{code}
!Screenshot from 2018-12-31 11-17-29.png!
   

the question I wanted answered is: How are we supposed to set up the job with the data we
have, and what you see as the best solution, might not be the right solution.
I asked this and you only responded to the other issue with manifold, It looked like you avoided
the question.
You suggested using the URL with the site-map but with excludes, and this is simply not possible
because the exclude list is to big an there is no reg exp. possible because of the randomness
of the links.
So on this part I also though that you were looking in to this and found a fix or edited code.

I'm sorry if my text was formed blunt but I'm just trying to get information and I didn't
know any other way to get your attention to the full picture of the comment.
English is not my first language so I'm sorry for my small vocabulary usage, google translate
also doesn't help on this part.
So i hope we can continue this communication to get to a solution, hopefully a solution that
works for both of us.

 

> Documents unreachable due to hopcount are not considered unreachable on cleanup pass
> ------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1562
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>             Fix For: ManifoldCF 2.12
>
>         Attachments: Screenshot from 2018-12-31 11-17-29.png, manifoldcf.log.cleanup,
manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to keep it runneng
even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message