manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1093) ManifoldCF document reprioritization bottleneck
Date Tue, 04 Nov 2014 08:39:33 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195881#comment-14195881
] 

Karl Wright commented on CONNECTORS-1093:
-----------------------------------------

The reprioritization code, which does things in batches of 10000, needs to use the bin preload
feature.  It needs to build all the PriorityCalculator objects as is currently done, but then
add a preload request using PriorityCalculator.makePreloadRequest().  Still looking for where
the preload request would be fired off.


> ManifoldCF document reprioritization bottleneck
> -----------------------------------------------
>
>                 Key: CONNECTORS-1093
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1093
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process
>    Affects Versions: ManifoldCF 1.7.2, ManifoldCF 1.8, ManifoldCF 2.0
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.7.2, ManifoldCF 1.8, ManifoldCF 2.0
>
>
> Starting a job with 200K+ documents now takes many minutes.  The reason seems to be document
reprioritization, which has a significant bottleneck.  A thread dump shows:
> {code}
> 	at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:694)
> 	at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728)
> 	at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:762)
> 	at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1435)
> 	at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)
> 	at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191)
> 	at org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performModification(DBInterfaceHSQLDB.java:750)
> 	at org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performUpdate(DBInterfaceHSQLDB.java:296)
> 	at org.apache.manifoldcf.core.database.BaseTable.performUpdate(BaseTable.java:80)
> 	at org.apache.manifoldcf.crawler.bins.BinManager.getIncrementBinValues(BinManager.java:158)
> 	at org.apache.manifoldcf.crawler.reprioritizationtracker.ReprioritizationTracker.getIncrementBinValue(ReprioritizationTracker.java:328)
> 	at org.apache.manifoldcf.crawler.system.PriorityCalculator.getDocumentPriority(PriorityCalculator.java:145)
> 	at org.apache.manifoldcf.crawler.jobs.JobQueue.writeDocPriority(JobQueue.java:874)
> 	at org.apache.manifoldcf.crawler.jobs.JobManager.writeDocumentPriorities(JobManager.java:2142)
> 	at org.apache.manifoldcf.crawler.system.ManifoldCF.writeDocumentPriorities(ManifoldCF.java:1121)
> 	at org.apache.manifoldcf.crawler.system.ManifoldCF.resetAllDocumentPriorities(ManifoldCF.java:1054)
> 	at org.apache.manifoldcf.crawler.system.StartupThread.run(StartupThread.java:141)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message