manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jetnet <jet...@gmail.com>
Subject Re: Efficient delta (incremental) indexing
Date Wed, 03 Aug 2016 11:52:01 GMT
hi Karl,

I think, the information coming from CMS publishing logs and from NTFS
master file table is accurate :) We just need to handle it properly.
What I'm missing currently:

for 1) - an option "Enable Delete for initial seeding" true/false (default
"false")
for 2b) - a query parameter for the JSON DELETE request: jobs/*<job_id>*
?purgeindex=*<true|false>* (default "true")

I guess, it's worth doing that, because it would allow us to improve the
incremental indexing enormously, e.g. several days (for file shares) vs.
several dozen seconds.
Thanks!
--
rgds,
Konstantin

2016-08-03 12:29 GMT+02:00 Karl Wright <daddywri@gmail.com>:

> The crawler is supposed to have an accurate idea what's been indexed. If
> it doesn't then any incremental decisions it makes will probably be wrong.
> It sounds like you're trying to make it work with inaccurate information,
> so yes, I don't see any good way to make that work.
>
> Effectively you need to have to crawler be the one the fills up the index
> in the first place; after that it should all be possible to do what you
> want.
>
> Karl
>
>
> On Wed, Aug 3, 2016 at 6:15 AM, jetnet <jetnet@gmail.com> wrote:
>
>> Hi All,
>>
>>
>>
>> I’m trying to find a way to reduce the time spent on incremental runs of
>> the crawler (HTTP, file system, file share) by creating a list of modified
>> files (created/modified and deleted).
>>
>> The challenge is how to supply the crawler with such list?
>>
>> There are great interfaces (JSON API and scripting language), which could
>> be used for that, but:
>>
>>
>>
>> 1) no deletion command gets sent to the index for NOT-Found (deleted
>> files) entries from the modification list, if the crawler hasn’t indexed
>> these files before
>>
>> 2a) re-using one “incremental” job: crawler would delete the previously
>> indexed documents, if it they don’t appear on the modification list anymore
>>
>> 2b) re-creating the “incremental” job every time: crawler would delete
>> ALL previous indexed docs from the index, if the job gets deleted
>>
>>
>>
>> So, currently I see no possibilities for the incremental indexing based
>> on a modification list without extending the functionality of the
>> framework, or maybe I missed something and there are features  I’m not
>> aware of?
>>
>> Thanks!
>>
>>
>> --
>>
>> rgds,
>>
>> Konstantin
>>
>
>

Mime
View raw message