manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1562) Document removal Elastic
Date Wed, 05 Dec 2018 16:16:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710279#comment-16710279
] 

Karl Wright commented on CONNECTORS-1562:
-----------------------------------------

Hi [~SteenTi], you will still not get unreachable documents deleted if you run your job using
the "minimal" cycle.  Please be sure you are using the "full" cycle.

If you need cycles that are very very short, you will need to make a tradeoff between getting
new content in and removing old content.  Typically we recommend that you schedule your job
to use "minimal" crawls most of the time, but use "full" runs periodically to clean out unreachable
documents.

If you believe you are running "full" crawls and there is still not any cleanup, I can assure
you that the Web Connector has automated tests that verify it does work properly to clean
up unreachable documents.  So there would be two possibilities: (1) this is specific to changes
in seeds, or (2) the Elastic Search Connector is transmitting deletes that are failing silently
for some reason.  In order to figure out which it is please run a cycle manually, and look
at the Simple History report to see if deletions are logged.


> Document removal Elastic
> ------------------------
>
>                 Key: CONNECTORS-1562
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>         Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to keep it runneng
even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message