Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@manifoldcf.apache.org
Received-SPF: pass (nike.apache.org: domain of daddywri@gmail.com designates
 209.85.213.45 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALUFAGDdsuNBV4WY0-Fkrt1xS=SRxcCoTwZ7aL1bRk6WFtOfWw@mail.gmail.com>
References: 
 <CAKoYS6KYqJSJbXUZeDWjfDjiXo-rLfX7nK2ohjLsNc+xU9EJiw@mail.gmail.com>
	<CALUFAGAsV3-Wn-+wmFPt68XkGiACS1Y+XrGZeizQmP+o=aAuSQ@mail.gmail.com>
	<CALUFAGC0qC6jvPqFZhkCm7kph_kwpSvr9DSAzEcnXkj=4Cwh+g@mail.gmail.com>
	<CAKoYS6JbttnhD9Tea3Ar2FQwpv4hRLZwE2tN7Wpge-CVG2ie=A@mail.gmail.com>
	<CALUFAGAVSzkLdNmpn7+PYJ3jjzNgOXhTccT5G11W8iwLx5zD=w@mail.gmail.com>
	<CALUFAGB8TPQtdf-+0JG_hck9bBs-XHtTHPuM9yC92tc+nLqPhQ@mail.gmail.com>
	<CAKoYS6+VGHBc1wc314bqBXf3_ut-NSh=xqSTfiu3seqOY8X1Vw@mail.gmail.com>
	<CALUFAGCjKNV_+pyp5=gFA5t0JrS8dhrkXa4cZWjJ2xhpM+x07Q@mail.gmail.com>
	<CALUFAGCQf=E6Z_V5vi9xWM5QbLP32MghdEqGvH5S0e9-T=iGHw@mail.gmail.com>
	<CAKoYS6KqKRUPS15TK7wMDHebMaqLmoGeqWyhRKDYDnPVLoLcxw@mail.gmail.com>
	<CALUFAGDjZRWc4OdJVykHVtUvmGjUF-QfN2UpTxLY2-rWyx+a8A@mail.gmail.com>
	<CALUFAGAXiH48ePf7PA_kV1PCqbHPoxrwvKhbXw2cuo7wDtFaOg@mail.gmail.com>
	<CALUFAGDYEyC3zyydpGyzROOOa-YeUyUngRChtdfRsgG4H1TOgQ@mail.gmail.com>
	<CALUFAGDdsuNBV4WY0-Fkrt1xS=SRxcCoTwZ7aL1bRk6WFtOfWw@mail.gmail.com>
Date: Sun, 2 Nov 2014 21:58:24 -0500
Message-ID: 
 <CALUFAGBE=dBDgFqJ5tqh__kFdVz34MEAvbsLkVH3gJrh2juCdg@mail.gmail.com>
Subject: Re: (Continuous) crawl performance
From: Karl Wright <daddywri@gmail.com>
To: dev <dev@manifoldcf.apache.org>
Content-Type: multipart/alternative; boundary=001a1132e2202cd3ae0506eb8574

--001a1132e2202cd3ae0506eb8574
Content-Type: text/plain; charset=UTF-8

Ok, in order for everything to work as planned, you will also need the
patch for CONNECTORS-1091 applied.  In fact this latter problem is serious
enough that it may warrant a 1.7.2 release, along with CONNECTORS-1090
fixes.

Karl


On Sun, Nov 2, 2014 at 8:51 PM, Karl Wright <daddywri@gmail.com> wrote:

> Unfortunately, I've found that this change triggers another bug.  Please
> hold off until I have a fix for that one as well.
>
> Karl
>
>
> On Sun, Nov 2, 2014 at 9:34 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Aeham,
>>
>> The more I look at this, the more I'm convinced that there must be a bug
>> elsewhere that permits jobqueue records with non-null document priorities
>> to hang around.
>>
>> If you look at this JobManager method:
>>
>> public void finishJobStops(long timestamp, ArrayList modifiedJobs)
>>     throws ManifoldCFException
>>
>> ... which is used whenever a job is stopped for any reason, you will note
>> that the logic in question is *supposed* to clear the document priority
>> field for all documents belonging to that job.  If that works as it is
>> supposed to, then there should be *no* documents in the docpriority index
>> with a document priority which belong to inactive or paused jobs.
>>
>> So the only way that this index can be slow is if, perchance, PostgreSQL
>> puts "null" values for a field FIRST in the index.  But:
>>
>> >>>>>>
>>   /** Clear all document priorities for a job */
>>   public void clearDocPriorities(Long jobID)
>>     throws ManifoldCFException
>>   {
>>     HashMap map = new HashMap();
>>     map.put(prioritySetField,null);
>>     map.put(docPriorityField,nullDocPriority);
>>     ArrayList list = new ArrayList();
>>     String query = buildConjunctionClause(list,new ClauseDescription[]{
>>       new UnitaryClause(jobIDField,jobID)});
>>     performUpdate(map,"WHERE "+query,list,null);
>>     noteModifications(0,1,0);
>>   }
>> <<<<<<
>>
>> ... where "nullDocPriority" is declared as:
>>
>> >>>>>>
>>   public static final double noDocPriorityValue = 1e9;
>>   public static final Double nullDocPriority = new
>> Double(noDocPriorityValue + 1.0);
>> <<<<<<
>>
>> ... which is a value picked to be last, always.  (It was necessary to
>> pick a specific value because MySQL did not put nulls at the end, but
>> rather at the beginning.)
>>
>> So now I am very curious as to what is *really* at the front of this
>> index -- so please execute the query I sent and let me know.
>>
>> It occurs to me that there is another way for the query to be slow,
>> however -- which is if there are insufficient matching rows to meet the
>> LIMIT size.  In this case, the entire index would be scanned.  But this can
>> be readily addressed by simply putting in an appropriate WHERE condition:
>>
>> >>>>>>
>> new UnitaryClause("t0."+jobQueue.docPriorityField,"<",nullDocPriority),
>> <<<<<<
>>
>> If the results of the query I asked you to run look fine, then we may
>> want to try this next.  I've attached a second patch to CONNECTORS-1090.
>> Please apply both patches to see if that fixes your query execution time.
>>
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Sat, Nov 1, 2014 at 8:09 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Aeham,
>>>
>>> To figure out what is going on, it would probably be sufficient to
>>> execute the following queries and send me the results:
>>>
>>> EXPLAIN SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>> FROM jobqueue t0 WHERE t0.checkaction='R' AND t0.checktime<=2414771559577
>>> ORDER BY t0.docpriority LIMIT 20;
>>> SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>> FROM jobqueue t0 WHERE t0.checkaction='R' AND t0.checktime<=2414771559577
>>> ORDER BY t0.docpriority LIMIT 20;
>>>
>>> These should both read directly from the index in question; we can see
>>> immediately what kinds of documents it is having to skip to get to the
>>> active ones.
>>>
>>> Thanks!
>>> Karl
>>>
>>>
>>> On Fri, Oct 31, 2014 at 2:42 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Aeham,
>>>>
>>>> I've looked at this at some level of detail now, and have some
>>>> conclusions.
>>>>
>>>> (1) The first query you provided is in fact using the proper index for
>>>> ordering; it's just not obvious from the plan.  The way you can tell is if
>>>> there is an explicit sort/limit.  The primary jobqueue index scan is
>>>> happening within an anti-join and a semi-join; you can tell this by cutting
>>>> out all the EXISTS clauses and getting a reduced plan that way, and
>>>> comparing.  For instance, here's an explain of the simple query:
>>>>
>>>> >>>>>>
>>>> dbname=# explain SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>>> FROM jobqueue t0  ORDER BY t0.docpriority LIMIT 240;
>>>>                                            QUERY PLAN
>>>>
>>>>
>>>> -------------------------------------------------------------------------------------------------
>>>>  Limit  (cost=0.29..38.20 rows=240 width=161)
>>>>    ->  Index Scan using i1413214448001 on jobqueue t0
>>>> (cost=0.29..2312.39 rows=14634 width=161)
>>>> (2 rows)
>>>> <<<<<<
>>>>
>>>> Applying filtering criteria:
>>>>
>>>> >>>>>>
>>>> dbname=# explain SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>>> FROM jobqueue t0  WHERE  (t0.status='P' OR t0.status='G') AND
>>>> t0.checkaction='R' AND t0.checktime<=2414771559577 ORDER BY t0.docpriority
>>>> LIMIT 240;
>>>>                                            QUERY PLAN
>>>>
>>>>
>>>> ------------------------------------------------------------------------------------------------
>>>>  Limit  (cost=0.29..68.98 rows=240 width=161)
>>>>    ->  Index Scan using i1413214448001 on jobqueue t0
>>>> (cost=0.29..2389.31 rows=8347 width=161)
>>>>          Index Cond: ((checkaction = 'R'::bpchar) AND (checktime <=
>>>> 2414771559577::bigint))
>>>>          Filter: ((status = 'P'::bpchar) OR (status = 'G'::bpchar))
>>>> (4 rows)
>>>> <<<<<<
>>>>
>>>> Now, this is a bit strange, because there is a filter on the status, so
>>>> the index isn't helping there, and yet the index includes the status:
>>>> "i1413214448001" btree (docpriority, status, checkaction, checktime)
>>>> But maybe this is a statistical optimization?  I don't know.
>>>>
>>>> Anyhow, as soon as a NOT EXISTS clause is added, you see an anti-join:
>>>>
>>>> >>>>>>
>>>> dbname=# explain SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>>> FROM jobqueue t0  WHERE  (t0.status='P' OR t0.status='G') AND
>>>> t0.checkaction='R' AND t0.checktime<=2414771559577 AND NOT EXISTS(SELECT
>>>> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
>>>> t3.eventname=t4.name) ORDER BY t0.docpriority LIMIT 240;
>>>>                                                 QUERY PLAN
>>>>
>>>>
>>>> ----------------------------------------------------------------------------------------------------------
>>>>  Limit  (cost=0.57..231.40 rows=240 width=161)
>>>>    ->  Nested Loop Anti Join  (cost=0.57..7957.24 rows=8273 width=161)
>>>>          ->  Index Scan using i1413214448001 on jobqueue t0
>>>> (cost=0.29..2389.31 rows=8347 width=161)
>>>>                Index Cond: ((checkaction = 'R'::bpchar) AND (checktime
>>>> <= 2414771559577::bigint))
>>>>                Filter: ((status = 'P'::bpchar) OR (status =
>>>> 'G'::bpchar))
>>>>          ->  Nested Loop  (cost=0.29..0.66 rows=1 width=8)
>>>>                ->  Index Scan using i1413214448007 on prereqevents t3
>>>> (cost=0.14..0.17 rows=1 width=524)
>>>>                      Index Cond: (t0.id = owner)
>>>>                ->  Index Only Scan using events_pkey on events t4
>>>> (cost=0.14..0.48 rows=1 width=516)
>>>>                      Index Cond: (name = (t3.eventname)::text)
>>>> (10 rows)
>>>> <<<<<<
>>>>
>>>> The index scan is still there and is apparently providing the sorting,
>>>> which is good still.  It's just inside an anti-join.
>>>>
>>>> So you can take this further and add the two remaining EXISTS/NOT
>>>> EXISTS clauses, and then your first plan makes sense.  But your proposed
>>>> *second* plan is clearly *not* using the docpriority index for sorting,
>>>> since it sorts by itself later on:
>>>>
>>>> >>>>>>
>>>>  Limit  (cost=122440.72..122441.42 rows=280 width=177) (actual
>>>> time=1010.722..1010.722 rows=0 loops=1)
>>>>    ->  Sort  (cost=122440.72..122451.13 rows=4164 width=177) (actual
>>>> time=1010.719..1010.719 rows=0 loops=1)
>>>>          Sort Key: t0.docpriority
>>>>          Sort Method: quicksort  Memory: 25kB
>>>> <<<<<<
>>>>
>>>> So what we need to understand now is why the first query is slow.  It's
>>>> reading from the correct index, but is nevertheless taking a long time,
>>>> possibly because there are a lot of records with status 'P' or 'G' that
>>>> belong to jobs that are not in state 'a' or 'A' that it has to go through
>>>> before it assembles 240 results.  Can you confirm this picture?
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Fri, Oct 31, 2014 at 1:09 PM, Aeham Abushwashi <
>>>> aeham.abushwashi@exonar.com> wrote:
>>>>
>>>>> That's great news, thanks Karl! On my part, I've been digging deeper
>>>>> into
>>>>> the execution plans of the two queries and I've made a similar
>>>>> observation
>>>>> of the ORDER BY skewing the original query's plan significantly.
>>>>> However, I
>>>>> have yet to figure out why the same ORDER BY clause doesn't have a
>>>>> similar
>>>>> impact on the modified query that uses JOIN. I'll look into this a bit
>>>>> more
>>>>> over the weekend.
>>>>>
>>>>> Best regards,
>>>>> Aeham
>>>>>
>>>>
>>>>
>>>
>>
>

--001a1132e2202cd3ae0506eb8574--