manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shigeki Kobayashi <shigeki.kobayas...@g.softbank.co.jp>
Subject Re: "start minimal" option even deletes contents whose links are deleted
Date Wed, 24 Dec 2014 06:31:30 GMT
Hi Karl.

Here are what I understand.

Minimal crawl does not do the clean up phase.
The clean up phase removes no-longer-reachable documents.
Even when a link of a page is removed from the root page, minimal crawl is
not supposed to remove the index of the no-longer-reachable page.
If hop count is set to 1, then the no-longer-reachable page should not be
affected because its hop count does not exceed 1.

If I am correct above, then I do not understand why the index of the
non-reachable page is deleted.


2014-12-24 13:59 GMT+09:00 Karl Wright <daddywri@gmail.com>:
>
> Hi Shigeki,
>
> Minimal crawls do not guarantee that there is no document deletion.  Such
> crawls only do the least amount of work possible based on what model the
> underlying connector implements.  This often just means not doing the
> "cleanup" phase at the end of the job run, which typically removes
> no-longer-reachable documents.  But if, for instance, you are using the web
> connector and you have hop count filtering enabled, then the framework will
> keep track of hop count and will remove all documents that exceed it, which
> does not require the end-of-job cleanup phase.
>
> If your goal is to avoid removing any previously crawled documents, then I
> am afraid that MCF does not have any real support for your model.  "Start
> minimal" is certainly not going to help you.
>
> Thanks,
> karl
>
>

Mime
View raw message