manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Steenbeke (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
Date Tue, 18 Dec 2018 12:48:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16724016#comment-16724016
] 

Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 12:47 PM:
----------------------------------------------------------------------

There is no regex, there is no possibility to make a regex for this. That's the issue with
creating the exclude/blacklist. there is already a regex being used for images, documents
and other files that don't have to be crawled.

'started acting strange' stopped working and crashed, because of the amount of URL's it stopped
indexing, no messages or error's were given, just the job stopped working.
----
This is not the question. answer my question please.
 is this the way we have to run the job:
{panel}
_*Tim Steenbeke added a comment*_

we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES output, web input
and Hop-count 1 for links and 0 for redirect:
 # run job
 # +-29000 documents get pushed to ES
 # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
 # wait till scheduled time
 # run job
 # documents get add/deleted (e.g.: 10 documents deleted)
 # wait till scheduled time
 # ...{panel}
 


was (Author: steenti):
There is no regex, there is no possibility to make a regex for this. That's the issue with
creating the exclude/blacklist.


'started acting strange' stopped working and crashed.
----
This is not the question. answer my question please.
is this the way we have to run the job:
{panel}
_*Tim Steenbeke added a comment*_

we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES output, web input
and Hop-count 1 for links and 0 for redirect:
 # run job
 # +-29000 documents get pushed to ES
 # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
 # wait till scheduled time
 # run job
 # documents get add/deleted (e.g.: 10 documents deleted)
 # wait till scheduled time
 # ...{panel}
 

> Documents unreachable due to hopcount are not considered unreachable on cleanup pass
> ------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1562
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>             Fix For: ManifoldCF 2.12
>
>         Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to keep it runneng
even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message