manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Steenbeke (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1562) Document removal Elastic
Date Wed, 05 Dec 2018 12:09:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709986#comment-16709986
] 

Tim Steenbeke commented on CONNECTORS-1562:
-------------------------------------------

The documentation states:
{code:java}
A typical non-continuous run of a job has the following stages of execution:

Adding the job's new, changed, or deleted starting points to the queue ("seeding")
Fetching documents, discovering new documents, and detecting deletions
Removing no-longer-included documents from the queue

Jobs can also be run "continuously", which means that the job never completes, unless it is
aborted. A continuous run has different stages of execution:

Adding the job's new, changed, or deleted starting points to the queue ("seeding")
Fetching documents, discovering new documents, and detecting deletions, while reseeding periodically

Note that continuous jobs cannot remove no-longer-included documents from the queue. They
can only remove documents that have been deleted from the repository.{code}
Both should detect deletions but only non-continuous should delete the unreachable documents.
so knowing this i changed the job to a non-continuous job that starts every 5 min for testing.
Even when the job is non-continuous it doesn't delete the unreachable documents
It keeps all documents indexed in elastic

> Document removal Elastic
> ------------------------
>
>                 Key: CONNECTORS-1562
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>         Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to keep it runneng
even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message