manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aeham Abushwashi <aeham.abushwa...@exonar.com>
Subject Re: (Continuous) crawl performance
Date Fri, 07 Nov 2014 16:50:41 GMT
With all crawler agents stopped, the current state of the system is as
follows (using a variant of the reprioritization query):

SELECT jobid, status, COUNT(*) FROM jobqueue t0 WHERE  (status='H' OR
status='P' OR status='G') AND priorityset<1415374856000 AND checkaction='R'
AND EXISTS(SELECT 'x' FROM jobs t1 WHERE  (t1.status='B' OR t1.status='b'
OR t1.status='A' OR t1.status='a' OR t1.status='R' OR t1.status='r' OR
t1.status='O' OR t1.status='o' OR t1.status='U' OR t1.status='u') AND
t1.id=t0.jobid)
GROUP BY jobid, status ORDER BY jobid;
     jobid     | status |  count
---------------+--------+---------
 1413897188357 | P      |   47281
 1413897188357 | G      | 2420206
 1414075755977 | G      |   43399
 1415018782273 | G      |       9
 1415020466369 | G      |      24

Note that all 4 crawls are in continuous mode and status Active.

When the first agent node is started, it iterates through the above docs
and updates their priority. Starting with the following query:

SELECT id,dochash,docid,jobid FROM jobqueue t0 WHERE  (status='H' OR
status='P' OR status='G') AND priorityset< [some timestamp] AND
checkaction='R' AND EXISTS(SELECT 'x' FROM jobs t1 WHERE  (t1.status='B' OR
t1.status='b' OR t1.status='A' OR t1.status='a' OR t1.status='R' OR
t1.status='r' OR t1.status='O' OR t1.status='o' OR t1.status='U' OR
t1.status='u') AND t1.id=t0.jobid) LIMIT 10000

which takes ~4 seconds for each batch. The AgentsThread then proceeds to
update their priority one by one. This, the VisualVM profiler suggests,
adds another ~9 seconds to each iteration and hits postgresql (IO) really
hard. With ~2.5M docs to go through, the numbers gradually add up.

I was also wondering, and this is just me speculating at this point,
whether the exact same set of docs would be re-processed every time another
crawler agent node is started. I say that because, the AgentsThread sets
priorityset to current time which presumably would lead any subsequent
starting node to hit the (moe or less) same set of documents as the
previous one did.


For reference, the execution plan of the SELECT QUERY is as follows:

 EXPLAIN ANALYZE SELECT id,dochash,docid,jobid FROM jobqueue t0 WHERE
(status='H' OR status='P' OR status='G') AND priorityset<1415374856000 AND
checkaction='R' AND EXISTS(SELECT 'x' FROM jobs t1 WHERE  (t1.status='B' OR
t1.status='b' OR t1.status='A' OR t1.status='a' OR t1.status='R' OR
t1.status='r' OR t1.status='O' OR t1.status='o' OR t1.status='U' OR
t1.status='u') AND t1.id=t0.jobid) LIMIT 10000;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Limit  (cost=19486.41..57634.31 rows=10000 width=143) (actual
time=1646.862..4869.705 rows=10000 loops=1)
   ->  Hash Join  (cost=19486.41..1456197.07 rows=376616 width=143) (actual
time=1646.859..4864.841 rows=10000 loops=1)
         Hash Cond: (t0.jobid = t1.id)
         ->  Bitmap Heap Scan on jobqueue t0  (cost=19401.16..1451681.36
rows=690462 width=143) (actual time=1646.308..4096.874 rows=2768867 loops=1)
               Recheck Cond: (((status = 'H'::bpchar) AND (priorityset <
1415374856000::bigint)) OR ((status = 'P'::bpchar) AND (checkaction =
'R'::bpchar)) OR ((status = 'G'::bpchar) AND (checkaction = 'R'::bpchar)))
               Filter: ((priorityset < 1415374856000::bigint) AND
(checkaction = 'R'::bpchar))
               ->  BitmapOr  (cost=19401.16..19401.16 rows=710369 width=0)
(actual time=1456.979..1456.979 rows=0 loops=1)
                     ->  Bitmap Index Scan on i1392985450173
(cost=0.00..10.31 rows=1 width=0) (actual time=0.029..0.029 rows=0 loops=1)
                           Index Cond: ((status = 'H'::bpchar) AND
(priorityset < 1415374856000::bigint))
                     ->  Bitmap Index Scan on i1392985450174
(cost=0.00..5524.25 rows=207759 width=0) (actual time=349.283..349.283
rows=1552934 loops=1)
                           Index Cond: ((status = 'P'::bpchar) AND
(checkaction = 'R'::bpchar))
                     ->  Bitmap Index Scan on i1392985450174
(cost=0.00..13348.76 rows=502610 width=0) (actual time=1107.663..1107.663
rows=4266856 loops=1)
                           Index Cond: ((status = 'G'::bpchar) AND
(checkaction = 'R'::bpchar))
         ->  Hash  (cost=85.10..85.10 rows=12 width=8) (actual
time=0.074..0.074 rows=4 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 2kB
               ->  Bitmap Heap Scan on jobs t1  (cost=46.64..85.10 rows=12
width=8) (actual time=0.058..0.068 rows=4 loops=1)
                     Recheck Cond: ((status = 'B'::bpchar) OR (status =
'b'::bpchar) OR (status = 'A'::bpchar) OR (status = 'a'::bpchar) OR (status
= 'R'::bpchar) OR (status = 'r'::bpchar) OR (status = 'O'::bpchar) OR
(status = 'o'::bpch
ar) OR (status = 'U'::bpchar) OR (status = 'u'::bpchar))
                     ->  BitmapOr  (cost=46.64..46.64 rows=13 width=0)
(actual time=0.049..0.049 rows=0 loops=1)
                           ->  Bitmap Index Scan on i1392985450183
(cost=0.00..4.26 rows=1 width=0) (actual time=0.019..0.019 rows=0 loops=1)
                                 Index Cond: (status = 'B'::bpchar)
                           ->  Bitmap Index Scan on i1392985450183
(cost=0.00..4.26 rows=1 width=0) (actual time=0.003..0.003 rows=0 loops=1)
                                 Index Cond: (status = 'b'::bpchar)
                           ->  Bitmap Index Scan on i1392985450183
(cost=0.00..8.28 rows=4 width=0) (actual time=0.008..0.008 rows=4 loops=1)
                                 Index Cond: (status = 'A'::bpchar)
                           ->  Bitmap Index Scan on i1392985450183
(cost=0.00..4.26 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
                                 Index Cond: (status = 'a'::bpchar)
                           ->  Bitmap Index Scan on i1392985450183
(cost=0.00..4.26 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=1)
                                 Index Cond: (status = 'R'::bpchar)
                           ->  Bitmap Index Scan on i1392985450183
(cost=0.00..4.26 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
                                 Index Cond: (status = 'r'::bpchar)
                           ->  Bitmap Index Scan on i1392985450183
(cost=0.00..4.26 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
                                 Index Cond: (status = 'O'::bpchar)
                           ->  Bitmap Index Scan on i1392985450183
(cost=0.00..4.26 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
                                 Index Cond: (status = 'o'::bpchar)
                           ->  Bitmap Index Scan on i1392985450183
(cost=0.00..4.26 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
                                 Index Cond: (status = 'U'::bpchar)
                           ->  Bitmap Index Scan on i1392985450183
(cost=0.00..4.26 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
                                 Index Cond: (status = 'u'::bpchar)
 Total runtime: 4872.335 ms

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message