manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Steenbeke (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
Date Tue, 18 Dec 2018 12:46:00 GMT


Tim Steenbeke commented on CONNECTORS-1562:

There is no regex, there is no possibility to make a regex for this. That's the issue with
creating the exclude/blacklist.

'started acting strange' stopped working and crashed.
This is not the question. answer my question please.
is this the way we have to run the job:
_*Tim Steenbeke added a comment*_

we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES output, web input
and Hop-count 1 for links and 0 for redirect:
 # run job
 # +-29000 documents get pushed to ES
 # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
 # wait till scheduled time
 # run job
 # documents get add/deleted (e.g.: 10 documents deleted)
 # wait till scheduled time
 # ...{panel}

> Documents unreachable due to hopcount are not considered unreachable on cleanup pass
> ------------------------------------------------------------------------------------
>                 Key: CONNECTORS-1562
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>             Fix For: ManifoldCF 2.12
>         Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>   Original Estimate: 4h
>  Remaining Estimate: 4h
> My documents aren't removed from ElasticSearch index after rerunning the changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to keep it runneng
even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.

This message was sent by Atlassian JIRA

View raw message