manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: (Continuous) crawl performance
Date Sun, 02 Nov 2014 14:34:02 GMT
Hi Aeham,

The more I look at this, the more I'm convinced that there must be a bug
elsewhere that permits jobqueue records with non-null document priorities
to hang around.

If you look at this JobManager method:

public void finishJobStops(long timestamp, ArrayList modifiedJobs)
    throws ManifoldCFException

... which is used whenever a job is stopped for any reason, you will note
that the logic in question is *supposed* to clear the document priority
field for all documents belonging to that job.  If that works as it is
supposed to, then there should be *no* documents in the docpriority index
with a document priority which belong to inactive or paused jobs.

So the only way that this index can be slow is if, perchance, PostgreSQL
puts "null" values for a field FIRST in the index.  But:

>>>>>>
  /** Clear all document priorities for a job */
  public void clearDocPriorities(Long jobID)
    throws ManifoldCFException
  {
    HashMap map = new HashMap();
    map.put(prioritySetField,null);
    map.put(docPriorityField,nullDocPriority);
    ArrayList list = new ArrayList();
    String query = buildConjunctionClause(list,new ClauseDescription[]{
      new UnitaryClause(jobIDField,jobID)});
    performUpdate(map,"WHERE "+query,list,null);
    noteModifications(0,1,0);
  }
<<<<<<

... where "nullDocPriority" is declared as:

>>>>>>
  public static final double noDocPriorityValue = 1e9;
  public static final Double nullDocPriority = new
Double(noDocPriorityValue + 1.0);
<<<<<<

... which is a value picked to be last, always.  (It was necessary to pick
a specific value because MySQL did not put nulls at the end, but rather at
the beginning.)

So now I am very curious as to what is *really* at the front of this index
-- so please execute the query I sent and let me know.

It occurs to me that there is another way for the query to be slow, however
-- which is if there are insufficient matching rows to meet the LIMIT
size.  In this case, the entire index would be scanned.  But this can be
readily addressed by simply putting in an appropriate WHERE condition:

>>>>>>
new UnitaryClause("t0."+jobQueue.docPriorityField,"<",nullDocPriority),
<<<<<<

If the results of the query I asked you to run look fine, then we may want
to try this next.  I've attached a second patch to CONNECTORS-1090.  Please
apply both patches to see if that fixes your query execution time.


Thanks,
Karl



On Sat, Nov 1, 2014 at 8:09 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Aeham,
>
> To figure out what is going on, it would probably be sufficient to execute
> the following queries and send me the results:
>
> EXPLAIN SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
> FROM jobqueue t0 WHERE t0.checkaction='R' AND t0.checktime<=2414771559577
> ORDER BY t0.docpriority LIMIT 20;
> SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
> FROM jobqueue t0 WHERE t0.checkaction='R' AND t0.checktime<=2414771559577
> ORDER BY t0.docpriority LIMIT 20;
>
> These should both read directly from the index in question; we can see
> immediately what kinds of documents it is having to skip to get to the
> active ones.
>
> Thanks!
> Karl
>
>
> On Fri, Oct 31, 2014 at 2:42 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Aeham,
>>
>> I've looked at this at some level of detail now, and have some
>> conclusions.
>>
>> (1) The first query you provided is in fact using the proper index for
>> ordering; it's just not obvious from the plan.  The way you can tell is if
>> there is an explicit sort/limit.  The primary jobqueue index scan is
>> happening within an anti-join and a semi-join; you can tell this by cutting
>> out all the EXISTS clauses and getting a reduced plan that way, and
>> comparing.  For instance, here's an explain of the simple query:
>>
>> >>>>>>
>> dbname=# explain SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>> FROM jobqueue t0  ORDER BY t0.docpriority LIMIT 240;
>>                                            QUERY PLAN
>>
>>
>> -------------------------------------------------------------------------------------------------
>>  Limit  (cost=0.29..38.20 rows=240 width=161)
>>    ->  Index Scan using i1413214448001 on jobqueue t0
>> (cost=0.29..2312.39 rows=14634 width=161)
>> (2 rows)
>> <<<<<<
>>
>> Applying filtering criteria:
>>
>> >>>>>>
>> dbname=# explain SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>> FROM jobqueue t0  WHERE  (t0.status='P' OR t0.status='G') AND
>> t0.checkaction='R' AND t0.checktime<=2414771559577 ORDER BY t0.docpriority
>> LIMIT 240;
>>                                            QUERY PLAN
>>
>>
>> ------------------------------------------------------------------------------------------------
>>  Limit  (cost=0.29..68.98 rows=240 width=161)
>>    ->  Index Scan using i1413214448001 on jobqueue t0
>> (cost=0.29..2389.31 rows=8347 width=161)
>>          Index Cond: ((checkaction = 'R'::bpchar) AND (checktime <=
>> 2414771559577::bigint))
>>          Filter: ((status = 'P'::bpchar) OR (status = 'G'::bpchar))
>> (4 rows)
>> <<<<<<
>>
>> Now, this is a bit strange, because there is a filter on the status, so
>> the index isn't helping there, and yet the index includes the status:
>> "i1413214448001" btree (docpriority, status, checkaction, checktime)
>> But maybe this is a statistical optimization?  I don't know.
>>
>> Anyhow, as soon as a NOT EXISTS clause is added, you see an anti-join:
>>
>> >>>>>>
>> dbname=# explain SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>> FROM jobqueue t0  WHERE  (t0.status='P' OR t0.status='G') AND
>> t0.checkaction='R' AND t0.checktime<=2414771559577 AND NOT EXISTS(SELECT
>> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=
>> t4.name) ORDER BY t0.docpriority LIMIT 240;
>>                                                 QUERY PLAN
>>
>>
>> ----------------------------------------------------------------------------------------------------------
>>  Limit  (cost=0.57..231.40 rows=240 width=161)
>>    ->  Nested Loop Anti Join  (cost=0.57..7957.24 rows=8273 width=161)
>>          ->  Index Scan using i1413214448001 on jobqueue t0
>> (cost=0.29..2389.31 rows=8347 width=161)
>>                Index Cond: ((checkaction = 'R'::bpchar) AND (checktime <=
>> 2414771559577::bigint))
>>                Filter: ((status = 'P'::bpchar) OR (status = 'G'::bpchar))
>>          ->  Nested Loop  (cost=0.29..0.66 rows=1 width=8)
>>                ->  Index Scan using i1413214448007 on prereqevents t3
>> (cost=0.14..0.17 rows=1 width=524)
>>                      Index Cond: (t0.id = owner)
>>                ->  Index Only Scan using events_pkey on events t4
>> (cost=0.14..0.48 rows=1 width=516)
>>                      Index Cond: (name = (t3.eventname)::text)
>> (10 rows)
>> <<<<<<
>>
>> The index scan is still there and is apparently providing the sorting,
>> which is good still.  It's just inside an anti-join.
>>
>> So you can take this further and add the two remaining EXISTS/NOT EXISTS
>> clauses, and then your first plan makes sense.  But your proposed *second*
>> plan is clearly *not* using the docpriority index for sorting, since it
>> sorts by itself later on:
>>
>> >>>>>>
>>  Limit  (cost=122440.72..122441.42 rows=280 width=177) (actual
>> time=1010.722..1010.722 rows=0 loops=1)
>>    ->  Sort  (cost=122440.72..122451.13 rows=4164 width=177) (actual
>> time=1010.719..1010.719 rows=0 loops=1)
>>          Sort Key: t0.docpriority
>>          Sort Method: quicksort  Memory: 25kB
>> <<<<<<
>>
>> So what we need to understand now is why the first query is slow.  It's
>> reading from the correct index, but is nevertheless taking a long time,
>> possibly because there are a lot of records with status 'P' or 'G' that
>> belong to jobs that are not in state 'a' or 'A' that it has to go through
>> before it assembles 240 results.  Can you confirm this picture?
>>
>> Thanks,
>> Karl
>>
>>
>> On Fri, Oct 31, 2014 at 1:09 PM, Aeham Abushwashi <
>> aeham.abushwashi@exonar.com> wrote:
>>
>>> That's great news, thanks Karl! On my part, I've been digging deeper into
>>> the execution plans of the two queries and I've made a similar
>>> observation
>>> of the ORDER BY skewing the original query's plan significantly.
>>> However, I
>>> have yet to figure out why the same ORDER BY clause doesn't have a
>>> similar
>>> impact on the modified query that uses JOIN. I'll look into this a bit
>>> more
>>> over the weekend.
>>>
>>> Best regards,
>>> Aeham
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message