manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-1497) Re-index seeded modified documents when the re-crawl interval is infinity and connector model is MODEL_ADD_CHANGE
Date Mon, 26 Feb 2018 17:51:00 GMT


Karl Wright commented on CONNECTORS-1497:

This is the wrong place to put this in any case.

Please examine the method signature:

  /** Add an initial set of documents to the queue.
  * This method is called during job startup, when the queue is being loaded.
  * A set of document references is passed to this method, which updates the status of the
  * in the specified job's queue, according to specific state rules.
  *@param processID is the current process ID.
  *@param jobID is the job identifier.
  *@param legalLinkTypes is the set of legal link types that this connector generates.
  *@param docIDs are the local document identifiers.
  *@param overrideSchedule is true if any existing document schedule should be overridden.
  *@param hopcountMethod is either accurate, nodelete, or neverdelete.
  *@param documentPriorities are the document priorities corresponding to the document identifiers.
  *@param prereqEventNames are the events that must be completed before each document can
be processed.
  public void addDocumentsInitial(String processID, Long jobID, String[] legalLinkTypes,
    String[] docIDHashes, String[] docIDs, boolean overrideSchedule,
    int hopcountMethod, IPriorityCalculator[] documentPriorities,
    String[][] prereqEventNames)
    throws ManifoldCFException

Note the parameter called "overrideSchedule".  You want to set that to "true" to override
the schedule in the manner you are trying to do.

This method is called during seeding.  When this is called during the run of a non-continuous
job, overrideSchedule=true already.  So the question is whether you want all *continuous*
jobs to override the schedule every time they reseed.  I'm still not sold that that is the
right thing, but assuming it is, then you want to find where that happens (it's a different
thread that does continuous job seeding than does initial job seeding) and change that parameter
in the addDocumentsInitial() method call there.

> Re-index seeded modified documents when the re-crawl interval is infinity and   connector
> -------------------------------------------------------------------------------------------------------------------
>                 Key: CONNECTORS-1497
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework agents process
>    Affects Versions: ManifoldCF 2.9.1
>            Reporter: Ahmed Mahfouz
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: CONNECTORS-1497.patch
> Trying to avoid a full scan of all documents for a better efficiency with a large number
of documents. I tried so many different setting for the Jobs but I couldn't accomplish that.
Especially when the repository connector model is MODEL_ADD_CHANGE I was expecting the modified
documents seeded should be re-indexed immediately similar to the new seeds but I found out
it uses the re-crawl time as the scheduled time and it waits for the full scan to get re-indexed.
I avoided full scan by setting the re-crawl interval to infinity but still, my modified documents
seeds were not getting indexed. After digging into the code for quite good time. I did some
modification to the JobManager and it worked for me. I would like to share the change with
you for review so I opened this ticket.

This message was sent by Atlassian JIRA

View raw message