manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ahmed Mahfouz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1497) Re-index seeded modified documents when the re-crawl interval is infinity and connector model is MODEL_ADD_CHANGE
Date Mon, 26 Feb 2018 18:00:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377280#comment-16377280
] 

Ahmed Mahfouz commented on CONNECTORS-1497:
-------------------------------------------

[~kwright@metacarta.com] I thought of that but I didn't want to change how manifold works
for continuous jobs I wanted to just limit to the jobs with re-crawl interval infinity (checkTimeValue
is null) to be able to reindex the modified documents seeds right away. Still with override
schedule set to true the status of PENDINGPURGATORY is a hurdle to modify the executionTime.
{code:java}
/** Update an existing record (as the result of an initial add).
* The record is presumed to exist and have been locked, via "FOR UPDATE".
*/
public void updateExistingRecordInitial(Long recordID, int currentStatus, Long checkTimeValue,
long desiredExecuteTime, IPriorityCalculator desiredPriority, String[] prereqEvents,
String processID)
throws ManifoldCFException
{
// The general rule here is:
// If doesn't exist, make a PENDING entry.
// If PENDING, keep it as PENDING. 
// If COMPLETE, make a PENDING entry.
// If PURGATORY, make a PENDINGPURGATORY entry.
// Leave everything else alone and do nothing.

HashMap map = new HashMap();
switch (currentStatus)
{
case STATUS_ACTIVE:
case STATUS_ACTIVEPURGATORY:
case STATUS_ACTIVENEEDRESCAN:
case STATUS_ACTIVENEEDRESCANPURGATORY:
case STATUS_BEINGCLEANED:
// These are all the active states. Being in this state implies that a thread may be working
on the document. We
// must not interrupt it.
// Initial adds never bring along any carrydown info, so we should be satisfied as long as
the record exists.
break;

case STATUS_COMPLETE:
case STATUS_UNCHANGED:
case STATUS_PURGATORY:
// Set the status and time both
map.put(statusField,statusToString(STATUS_PENDINGPURGATORY));
TrackerClass.noteRecordChange(recordID, STATUS_PENDINGPURGATORY, "Update existing record initial");
if (desiredExecuteTime == -1L)
map.put(checkTimeField,new Long(0L));
else
map.put(checkTimeField,new Long(desiredExecuteTime));
map.put(checkActionField,actionToString(ACTION_RESCAN));
map.put(failTimeField,null);
map.put(failCountField,null);
// Update the doc priority.
map.put(docPriorityField,new Double(desiredPriority.getDocumentPriority()));
map.put(needPriorityField,needPriorityToString(NEEDPRIORITY_FALSE));
break;

case STATUS_PENDING:
// Bump up the schedule if called for
Long cv = checkTimeValue;
if (cv != null)
{
long currentExecuteTime = cv.longValue();
if ((desiredExecuteTime == -1L ||currentExecuteTime <= desiredExecuteTime))
{
break;
}
}
else
{
if (desiredExecuteTime == -1L)
{
break;
}
}
map.put(checkTimeField,new Long(desiredExecuteTime));
map.put(checkActionField,actionToString(ACTION_RESCAN));
map.put(failTimeField,null);
map.put(failCountField,null);
// The existing doc priority field should be preserved.
break;

case STATUS_PENDINGPURGATORY:
// In this case we presume that the reason we are in this state is due to adaptive crawling
or retry, so DON'T bump up the schedule!
// The existing doc priority field should also be preserved.
break;

default:
break;

}
map.put(isSeedField,seedstatusToString(SEEDSTATUS_NEWSEED));
map.put(seedingProcessIDField,processID);
// Delete any existing prereqevent entries first
prereqEventManager.deleteRows(recordID);
ArrayList list = new ArrayList();
String query = buildConjunctionClause(list,new ClauseDescription[]{
new UnitaryClause(idField,recordID)});
performUpdate(map,"WHERE "+query,list,null);
// Insert prereqevent entries, if any
prereqEventManager.addRows(recordID,prereqEvents);
noteModifications(0,1,0);
}
{code}

> Re-index seeded modified documents when the re-crawl interval is infinity and   connector
model is MODEL_ADD_CHANGE
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1497
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1497
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework agents process
>    Affects Versions: ManifoldCF 2.9.1
>            Reporter: Ahmed Mahfouz
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: CONNECTORS-1497.patch
>
>
> Trying to avoid a full scan of all documents for a better efficiency with a large number
of documents. I tried so many different setting for the Jobs but I couldn't accomplish that.
Especially when the repository connector model is MODEL_ADD_CHANGE I was expecting the modified
documents seeded should be re-indexed immediately similar to the new seeds but I found out
it uses the re-crawl time as the scheduled time and it waits for the full scan to get re-indexed.
I avoided full scan by setting the re-crawl interval to infinity but still, my modified documents
seeds were not getting indexed. After digging into the code for quite good time. I did some
modification to the JobManager and it worked for me. I would like to share the change with
you for review so I opened this ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message