manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1562) Document removal Elastic
Date Mon, 10 Dec 2018 14:57:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16714835#comment-16714835
] 

Karl Wright commented on CONNECTORS-1562:
-----------------------------------------

Hi [~SteenTi], you are in essence making a seed list that is intended to be the entire list
of all URLs that are crawled, and using hopcount filtering to try and make sure no links are
taken.  You are then removing individual seeds and expecting the individual URLs to be removed
from the index.  This is a usage model that is not well tested (because of the hopcount involvement),
so I can well believe it doesn't do exactly what you'd expect.

We do not generally recommend this model because the seed list may well wind up being huge.
 If there's no way you can create an index page of some kind, then you might be stuck with
it, but bear in mind that the Web Connector is not designed to support this model.

If this is the model you nevertheless intend to operate under, I will reopen the ticket and
try to reproduce the problem, but it will not be looked at until next weekend at the earliest,
as this is not my day job and this is not a supported model.




> Document removal Elastic
> ------------------------
>
>                 Key: CONNECTORS-1562
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>         Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 14-07-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to keep it runneng
even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message