manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aeham Abushwashi <aeham.abushwa...@exonar.com>
Subject (Continuous) crawl performance
Date Thu, 30 Oct 2014 23:30:41 GMT
Hi,

I have a multi-node zookeeper-based test environment with ~29M documents
(~36M jobqueue records and ~29M ingeststatus records) and jobs in varying
states of being (running vs complete vs stopped vs paused). Some jobs are
continuous and a few are not.

During a quiet time -when no documents are actively ingested- I see very
high CPU utilisation on the Postgresql database server. This is primarily a
result of two queries executed by the Stuffer Thread; namely the main query
in JobManager#fetchAndProcessDocuments and the doc priority query in
JobManager#getNextDocuments

==========================

Query 1:
SELECT t0.id,t0.docpriority,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
FROM jobqueue t0  WHERE  (t0.status=? OR t0.status=?) AND t0.checkaction=?
AND t0.checktime<=? AND EXISTS(SELECT 'x' FROM jobs t1 WHERE  (t1.status=?
OR t1.status=?) AND t1.id=t0.jobid AND t1.priority=?) AND NOT EXISTS(SELECT
'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN
(?,?,?,?,?,?) AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM
prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name)
ORDER BY t0.docpriority ASC LIMIT 280

Query 1 example params:
[P, G, R, 1414683318802, A, a, 5, A, F, a, f, D, d]

++++++++

Query 2:
SELECT docpriority,jobid,dochash,docid FROM jobqueue t0  WHERE  (status=?
OR status=?) AND checkaction=? AND checktime<=? AND EXISTS(SELECT 'x' FROM
jobs t1 WHERE  (t1.status=? OR t1.status=?) AND t1.id=t0.jobid)  ORDER BY
docpriority ASC LIMIT 1

Query 2 example params:
[P, G, R, 1414685283327, A, a]

==========================

There is one continuous job in a running state and which has ~50K documents
in a pending-purgatory ('G') state but not due for processing until another
couple of days. There are a few other jobs which are stopped and have a few
million documents in a pending-purgatory state. I don't expect Manifold to
find any documents eligible for processing in this case and it doesn't, but
it takes time and (more seriously) a big chunk of Postgresql CPU to find
the answer, 1 full CPU core for ~15 secs per MCF instance. This wouldn't be
a big deal if the queries were executed infrequently. Unfortunately, they
are run in a loop with a 2 second pause between iterations.
I'd argue that it would be relatively easy to run into this issue with a
bunch of large continuous jobs scheduled to run at varying times of the day.

With my limited knowledge of the inner workings of the Stuffer Thread, it's
not obvious to me why query 2 needs to run IF query 1 returns no hits at
all. If it's not needed, then short-circuitting it in this case would help.

Increasing the sleep time between iterations of the Stuffer Thread, in
response to there being little or no work for a number of consecutive
iterations, could be another easy win.

More profoundly would be to optimise the joins between the jobs table and
the jobqueue table. One way to do that would be to move some of the logic
to the Java code to first identify active jobs and then query the jobqueue
table just for those. I understand this would require merging and sorting
priorities in the Java code which is less than ideal. I continue to analyse
the execution plans of the two queries and experiment with slight variants
to test performance. I'll report back any important findings.

Any thoughts or advice are welcome. Also if have experience running large
continuous crawls I'd be interested to hear from you particularly on your
experience with performance and especially the performance of SQL queries.

Best regards,
Aeham

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message