manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: (Continuous) crawl performance
Date Thu, 30 Oct 2014 23:45:04 GMT
The ticket, in case you are interested, is CONNECTORS-1027.

Karl


On Thu, Oct 30, 2014 at 7:42 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Aeham,
>
> The problem is that you are not using MCF 1.7.1, which has a fix for the
> stuffer thread query that makes it run quickly on PostgreSQL 9.1 and above.
>
> Thanks,
> Karl
>
>
> On Thu, Oct 30, 2014 at 7:30 PM, Aeham Abushwashi <
> aeham.abushwashi@exonar.com> wrote:
>
>> Hi,
>>
>> I have a multi-node zookeeper-based test environment with ~29M documents
>> (~36M jobqueue records and ~29M ingeststatus records) and jobs in varying
>> states of being (running vs complete vs stopped vs paused). Some jobs are
>> continuous and a few are not.
>>
>> During a quiet time -when no documents are actively ingested- I see very
>> high CPU utilisation on the Postgresql database server. This is primarily
>> a
>> result of two queries executed by the Stuffer Thread; namely the main
>> query
>> in JobManager#fetchAndProcessDocuments and the doc priority query in
>> JobManager#getNextDocuments
>>
>> ==========================
>>
>> Query 1:
>> SELECT t0.id
>> ,t0.docpriority,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>> FROM jobqueue t0  WHERE  (t0.status=? OR t0.status=?) AND t0.checkaction=?
>> AND t0.checktime<=? AND EXISTS(SELECT 'x' FROM jobs t1 WHERE  (t1.status=?
>> OR t1.status=?) AND t1.id=t0.jobid AND t1.priority=?) AND NOT
>> EXISTS(SELECT
>> 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN
>> (?,?,?,?,?,?) AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM
>> prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name)
>> ORDER BY t0.docpriority ASC LIMIT 280
>>
>> Query 1 example params:
>> [P, G, R, 1414683318802, A, a, 5, A, F, a, f, D, d]
>>
>> ++++++++
>>
>> Query 2:
>> SELECT docpriority,jobid,dochash,docid FROM jobqueue t0  WHERE  (status=?
>> OR status=?) AND checkaction=? AND checktime<=? AND EXISTS(SELECT 'x' FROM
>> jobs t1 WHERE  (t1.status=? OR t1.status=?) AND t1.id=t0.jobid)  ORDER BY
>> docpriority ASC LIMIT 1
>>
>> Query 2 example params:
>> [P, G, R, 1414685283327, A, a]
>>
>> ==========================
>>
>> There is one continuous job in a running state and which has ~50K
>> documents
>> in a pending-purgatory ('G') state but not due for processing until
>> another
>> couple of days. There are a few other jobs which are stopped and have a
>> few
>> million documents in a pending-purgatory state. I don't expect Manifold to
>> find any documents eligible for processing in this case and it doesn't,
>> but
>> it takes time and (more seriously) a big chunk of Postgresql CPU to find
>> the answer, 1 full CPU core for ~15 secs per MCF instance. This wouldn't
>> be
>> a big deal if the queries were executed infrequently. Unfortunately, they
>> are run in a loop with a 2 second pause between iterations.
>> I'd argue that it would be relatively easy to run into this issue with a
>> bunch of large continuous jobs scheduled to run at varying times of the
>> day.
>>
>> With my limited knowledge of the inner workings of the Stuffer Thread,
>> it's
>> not obvious to me why query 2 needs to run IF query 1 returns no hits at
>> all. If it's not needed, then short-circuitting it in this case would
>> help.
>>
>> Increasing the sleep time between iterations of the Stuffer Thread, in
>> response to there being little or no work for a number of consecutive
>> iterations, could be another easy win.
>>
>> More profoundly would be to optimise the joins between the jobs table and
>> the jobqueue table. One way to do that would be to move some of the logic
>> to the Java code to first identify active jobs and then query the jobqueue
>> table just for those. I understand this would require merging and sorting
>> priorities in the Java code which is less than ideal. I continue to
>> analyse
>> the execution plans of the two queries and experiment with slight variants
>> to test performance. I'll report back any important findings.
>>
>> Any thoughts or advice are welcome. Also if have experience running large
>> continuous crawls I'd be interested to hear from you particularly on your
>> experience with performance and especially the performance of SQL queries.
>>
>> Best regards,
>> Aeham
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message