manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: (Continuous) crawl performance
Date Fri, 31 Oct 2014 18:42:49 GMT
Hi Aeham,

I've looked at this at some level of detail now, and have some conclusions.

(1) The first query you provided is in fact using the proper index for
ordering; it's just not obvious from the plan.  The way you can tell is if
there is an explicit sort/limit.  The primary jobqueue index scan is
happening within an anti-join and a semi-join; you can tell this by cutting
out all the EXISTS clauses and getting a reduced plan that way, and
comparing.  For instance, here's an explain of the simple query:

>>>>>>
dbname=# explain SELECT
t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
FROM jobqueue t0  ORDER BY t0.docpriority LIMIT 240;
                                           QUERY PLAN

-------------------------------------------------------------------------------------------------
 Limit  (cost=0.29..38.20 rows=240 width=161)
   ->  Index Scan using i1413214448001 on jobqueue t0  (cost=0.29..2312.39
rows=14634 width=161)
(2 rows)
<<<<<<

Applying filtering criteria:

>>>>>>
dbname=# explain SELECT
t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
FROM jobqueue t0  WHERE  (t0.status='P' OR t0.status='G') AND
t0.checkaction='R' AND t0.checktime<=2414771559577 ORDER BY t0.docpriority
LIMIT 240;
                                           QUERY PLAN

------------------------------------------------------------------------------------------------
 Limit  (cost=0.29..68.98 rows=240 width=161)
   ->  Index Scan using i1413214448001 on jobqueue t0  (cost=0.29..2389.31
rows=8347 width=161)
         Index Cond: ((checkaction = 'R'::bpchar) AND (checktime <=
2414771559577::bigint))
         Filter: ((status = 'P'::bpchar) OR (status = 'G'::bpchar))
(4 rows)
<<<<<<

Now, this is a bit strange, because there is a filter on the status, so the
index isn't helping there, and yet the index includes the status:
"i1413214448001" btree (docpriority, status, checkaction, checktime)
But maybe this is a statistical optimization?  I don't know.

Anyhow, as soon as a NOT EXISTS clause is added, you see an anti-join:

>>>>>>
dbname=# explain SELECT
t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
FROM jobqueue t0  WHERE  (t0.status='P' OR t0.status='G') AND
t0.checkaction='R' AND t0.checktime<=2414771559577 AND NOT EXISTS(SELECT
'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=
t4.name) ORDER BY t0.docpriority LIMIT 240;
                                                QUERY PLAN

----------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..231.40 rows=240 width=161)
   ->  Nested Loop Anti Join  (cost=0.57..7957.24 rows=8273 width=161)
         ->  Index Scan using i1413214448001 on jobqueue t0
(cost=0.29..2389.31 rows=8347 width=161)
               Index Cond: ((checkaction = 'R'::bpchar) AND (checktime <=
2414771559577::bigint))
               Filter: ((status = 'P'::bpchar) OR (status = 'G'::bpchar))
         ->  Nested Loop  (cost=0.29..0.66 rows=1 width=8)
               ->  Index Scan using i1413214448007 on prereqevents t3
(cost=0.14..0.17 rows=1 width=524)
                     Index Cond: (t0.id = owner)
               ->  Index Only Scan using events_pkey on events t4
(cost=0.14..0.48 rows=1 width=516)
                     Index Cond: (name = (t3.eventname)::text)
(10 rows)
<<<<<<

The index scan is still there and is apparently providing the sorting,
which is good still.  It's just inside an anti-join.

So you can take this further and add the two remaining EXISTS/NOT EXISTS
clauses, and then your first plan makes sense.  But your proposed *second*
plan is clearly *not* using the docpriority index for sorting, since it
sorts by itself later on:

>>>>>>
 Limit  (cost=122440.72..122441.42 rows=280 width=177) (actual
time=1010.722..1010.722 rows=0 loops=1)
   ->  Sort  (cost=122440.72..122451.13 rows=4164 width=177) (actual
time=1010.719..1010.719 rows=0 loops=1)
         Sort Key: t0.docpriority
         Sort Method: quicksort  Memory: 25kB
<<<<<<

So what we need to understand now is why the first query is slow.  It's
reading from the correct index, but is nevertheless taking a long time,
possibly because there are a lot of records with status 'P' or 'G' that
belong to jobs that are not in state 'a' or 'A' that it has to go through
before it assembles 240 results.  Can you confirm this picture?

Thanks,
Karl


On Fri, Oct 31, 2014 at 1:09 PM, Aeham Abushwashi <
aeham.abushwashi@exonar.com> wrote:

> That's great news, thanks Karl! On my part, I've been digging deeper into
> the execution plans of the two queries and I've made a similar observation
> of the ORDER BY skewing the original query's plan significantly. However, I
> have yet to figure out why the same ORDER BY clause doesn't have a similar
> impact on the modified query that uses JOIN. I'll look into this a bit more
> over the weekend.
>
> Best regards,
> Aeham
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message