manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1497) Re-index seeded modified documents when the re-crawl interval is infinity and connector model is MODEL_ADD_CHANGE
Date Tue, 27 Feb 2018 08:16:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378201#comment-16378201
] 

Karl Wright commented on CONNECTORS-1497:
-----------------------------------------

Seeding is a well-defined connector contract that, at this point, has nothing to do with document
scheduling, even in a continuous job.  The contract identifies specific documents based on
the repository's capabilities, and those documents are not chosen based on what you want processed
first, but rather on the requirements of the connector model.  Conflating the two I think
may obligate connectors to manage their own document scheduling and pick the documents they
want processed first.  That's a significant contract change and quite I'm concerned about
that. 

The reason you want to do this at all is because you don't actually want document recrawls
to take place on any schedule at all -- you've set the recrawl time to infinity.  That basically
defeats the continuous crawl model entirely and presumes that documents once crawled are never
changed or deleted unless you reseed them.  So the real reason you want to do this is to provide
a connector complete schedule control over what documents are processed when.  Presumably,
your connector knows about deletions too, then?  Is there any reason it shouldn't be written
as MODEL_ADD_CHANGE_DELETE?  Continuous MODEL_ADD_CHANGE_DELETE jobs are a new thing so if
this is your use case we should think it through carefully.


> Re-index seeded modified documents when the re-crawl interval is infinity and   connector
model is MODEL_ADD_CHANGE
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1497
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1497
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework agents process
>    Affects Versions: ManifoldCF 2.9.1
>            Reporter: Ahmed Mahfouz
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: CONNECTORS-1497.patch, CONNECTORS-1497.patch2, CONNECTORS-1497.patch3
>
>
> Trying to avoid a full scan of all documents for a better efficiency with a large number
of documents. I tried so many different setting for the Jobs but I couldn't accomplish that.
Especially when the repository connector model is MODEL_ADD_CHANGE I was expecting the modified
documents seeded should be re-indexed immediately similar to the new seeds but I found out
it uses the re-crawl time as the scheduled time and it waits for the full scan to get re-indexed.
I avoided full scan by setting the re-crawl interval to infinity but still, my modified documents
seeds were not getting indexed. After digging into the code for quite good time. I did some
modification to the JobManager and it worked for me. I would like to share the change with
you for review so I opened this ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message