manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Efficient delta (incremental) indexing
Date Wed, 03 Aug 2016 10:29:55 GMT
The crawler is supposed to have an accurate idea what's been indexed. If it
doesn't then any incremental decisions it makes will probably be wrong.  It
sounds like you're trying to make it work with inaccurate information, so
yes, I don't see any good way to make that work.

Effectively you need to have to crawler be the one the fills up the index
in the first place; after that it should all be possible to do what you
want.

Karl


On Wed, Aug 3, 2016 at 6:15 AM, jetnet <jetnet@gmail.com> wrote:

> Hi All,
>
>
>
> I’m trying to find a way to reduce the time spent on incremental runs of
> the crawler (HTTP, file system, file share) by creating a list of modified
> files (created/modified and deleted).
>
> The challenge is how to supply the crawler with such list?
>
> There are great interfaces (JSON API and scripting language), which could
> be used for that, but:
>
>
>
> 1) no deletion command gets sent to the index for NOT-Found (deleted
> files) entries from the modification list, if the crawler hasn’t indexed
> these files before
>
> 2a) re-using one “incremental” job: crawler would delete the previously
> indexed documents, if it they don’t appear on the modification list anymore
>
> 2b) re-creating the “incremental” job every time: crawler would delete ALL
> previous indexed docs from the index, if the job gets deleted
>
>
>
> So, currently I see no possibilities for the incremental indexing based on
> a modification list without extending the functionality of the framework,
> or maybe I missed something and there are features  I’m not aware of?
>
> Thanks!
>
>
> --
>
> rgds,
>
> Konstantin
>

Mime
View raw message