manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
Date Tue, 11 Dec 2018 14:43:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16717278#comment-16717278
] 

Karl Wright commented on CONNECTORS-1562:
-----------------------------------------

[~SteenTi] The issue was reopened many hours ago.  As I stated, however, it is a very complex
issue and may require significant framework changes to fix.  It cannot happen quickly for
this reason.  I estimate *at best* two weeks, and possibly a month or more. 
 Certainly not something you should count on tomorrow.  Furthermore, I continue to advise
against your general approach.

If you have a site map page, why can't you simply have *one* seed, pointing at that site map,
no hopcount filtering, and an exclusion list to remove pages you don't want indexed?  That's
how the connector is designed to work.  In that model URLs that are removed from the site,
or put into the exclusion list, *will* be deleted from the index.

If the customer's demands are rigid and they want a crawler where they simply load up the
queue with URLs, you always have the option of constructing an RSS feed or developing a custom
connector.  RSS feeds don't follow links in listed documents at all, and they would seem to
have everything else you need.


> Documents unreachable due to hopcount are not considered unreachable on cleanup pass
> ------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1562
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to keep it runneng
even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message