Return-Path: X-Original-To: apmail-manifoldcf-dev-archive@www.apache.org Delivered-To: apmail-manifoldcf-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6AB37106CF for ; Mon, 3 Nov 2014 03:01:08 +0000 (UTC) Received: (qmail 85753 invoked by uid 500); 3 Nov 2014 03:01:08 -0000 Delivered-To: apmail-manifoldcf-dev-archive@manifoldcf.apache.org Received: (qmail 85685 invoked by uid 500); 3 Nov 2014 03:01:07 -0000 Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@manifoldcf.apache.org Delivered-To: mailing list dev@manifoldcf.apache.org Received: (qmail 85671 invoked by uid 99); 3 Nov 2014 03:01:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Nov 2014 03:01:07 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of daddywri@gmail.com designates 209.85.213.45 as permitted sender) Received: from [209.85.213.45] (HELO mail-yh0-f45.google.com) (209.85.213.45) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Nov 2014 03:00:41 +0000 Received: by mail-yh0-f45.google.com with SMTP id f10so661226yha.18 for ; Sun, 02 Nov 2014 18:58:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=bGdaobJxJvTzOvrt6v/fHdF2qvfqiXIDhiwNsyVB0Ig=; b=QrFVVG64n7LCrdJ2nD09XzrmaZH3G+OZycGL6VkDLUpbVYNCvN80Zd1oN3//9vJJu1 yiQs+0fXFHOfJd63LVj2VsPWlNmZ4DLvNyboj2n17XzqlG6xWL7D+bAiSgxgZzdB4+n0 +7uZW8ZolSIWL/yGRUxco/okJ9R82Hhd3BdL5lnxdK0fZtRft/ZvrDoWLOp8lfV1OQx+ VRy5Q7uwiPkYjCftswBZS1jqgr4WeNfdBYQCf7IUh0SHK7sSlsxRHvmN8LxJQfNhvR0A VTSQFOeEYLXziMc3hmhi/zea6xLxNAb1fzSh9wJem6/SXY8nhcypEwYiF/79vp241aJg v+3g== MIME-Version: 1.0 X-Received: by 10.236.53.197 with SMTP id g45mr351465yhc.134.1414983504940; Sun, 02 Nov 2014 18:58:24 -0800 (PST) Received: by 10.170.189.67 with HTTP; Sun, 2 Nov 2014 18:58:24 -0800 (PST) In-Reply-To: References: Date: Sun, 2 Nov 2014 21:58:24 -0500 Message-ID: Subject: Re: (Continuous) crawl performance From: Karl Wright To: dev Content-Type: multipart/alternative; boundary=001a1132e2202cd3ae0506eb8574 X-Virus-Checked: Checked by ClamAV on apache.org --001a1132e2202cd3ae0506eb8574 Content-Type: text/plain; charset=UTF-8 Ok, in order for everything to work as planned, you will also need the patch for CONNECTORS-1091 applied. In fact this latter problem is serious enough that it may warrant a 1.7.2 release, along with CONNECTORS-1090 fixes. Karl On Sun, Nov 2, 2014 at 8:51 PM, Karl Wright wrote: > Unfortunately, I've found that this change triggers another bug. Please > hold off until I have a fix for that one as well. > > Karl > > > On Sun, Nov 2, 2014 at 9:34 AM, Karl Wright wrote: > >> Hi Aeham, >> >> The more I look at this, the more I'm convinced that there must be a bug >> elsewhere that permits jobqueue records with non-null document priorities >> to hang around. >> >> If you look at this JobManager method: >> >> public void finishJobStops(long timestamp, ArrayList modifiedJobs) >> throws ManifoldCFException >> >> ... which is used whenever a job is stopped for any reason, you will note >> that the logic in question is *supposed* to clear the document priority >> field for all documents belonging to that job. If that works as it is >> supposed to, then there should be *no* documents in the docpriority index >> with a document priority which belong to inactive or paused jobs. >> >> So the only way that this index can be slow is if, perchance, PostgreSQL >> puts "null" values for a field FIRST in the index. But: >> >> >>>>>> >> /** Clear all document priorities for a job */ >> public void clearDocPriorities(Long jobID) >> throws ManifoldCFException >> { >> HashMap map = new HashMap(); >> map.put(prioritySetField,null); >> map.put(docPriorityField,nullDocPriority); >> ArrayList list = new ArrayList(); >> String query = buildConjunctionClause(list,new ClauseDescription[]{ >> new UnitaryClause(jobIDField,jobID)}); >> performUpdate(map,"WHERE "+query,list,null); >> noteModifications(0,1,0); >> } >> <<<<<< >> >> ... where "nullDocPriority" is declared as: >> >> >>>>>> >> public static final double noDocPriorityValue = 1e9; >> public static final Double nullDocPriority = new >> Double(noDocPriorityValue + 1.0); >> <<<<<< >> >> ... which is a value picked to be last, always. (It was necessary to >> pick a specific value because MySQL did not put nulls at the end, but >> rather at the beginning.) >> >> So now I am very curious as to what is *really* at the front of this >> index -- so please execute the query I sent and let me know. >> >> It occurs to me that there is another way for the query to be slow, >> however -- which is if there are insufficient matching rows to meet the >> LIMIT size. In this case, the entire index would be scanned. But this can >> be readily addressed by simply putting in an appropriate WHERE condition: >> >> >>>>>> >> new UnitaryClause("t0."+jobQueue.docPriorityField,"<",nullDocPriority), >> <<<<<< >> >> If the results of the query I asked you to run look fine, then we may >> want to try this next. I've attached a second patch to CONNECTORS-1090. >> Please apply both patches to see if that fixes your query execution time. >> >> >> Thanks, >> Karl >> >> >> >> On Sat, Nov 1, 2014 at 8:09 AM, Karl Wright wrote: >> >>> Hi Aeham, >>> >>> To figure out what is going on, it would probably be sufficient to >>> execute the following queries and send me the results: >>> >>> EXPLAIN SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset >>> FROM jobqueue t0 WHERE t0.checkaction='R' AND t0.checktime<=2414771559577 >>> ORDER BY t0.docpriority LIMIT 20; >>> SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset >>> FROM jobqueue t0 WHERE t0.checkaction='R' AND t0.checktime<=2414771559577 >>> ORDER BY t0.docpriority LIMIT 20; >>> >>> These should both read directly from the index in question; we can see >>> immediately what kinds of documents it is having to skip to get to the >>> active ones. >>> >>> Thanks! >>> Karl >>> >>> >>> On Fri, Oct 31, 2014 at 2:42 PM, Karl Wright wrote: >>> >>>> Hi Aeham, >>>> >>>> I've looked at this at some level of detail now, and have some >>>> conclusions. >>>> >>>> (1) The first query you provided is in fact using the proper index for >>>> ordering; it's just not obvious from the plan. The way you can tell is if >>>> there is an explicit sort/limit. The primary jobqueue index scan is >>>> happening within an anti-join and a semi-join; you can tell this by cutting >>>> out all the EXISTS clauses and getting a reduced plan that way, and >>>> comparing. For instance, here's an explain of the simple query: >>>> >>>> >>>>>> >>>> dbname=# explain SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset >>>> FROM jobqueue t0 ORDER BY t0.docpriority LIMIT 240; >>>> QUERY PLAN >>>> >>>> >>>> ------------------------------------------------------------------------------------------------- >>>> Limit (cost=0.29..38.20 rows=240 width=161) >>>> -> Index Scan using i1413214448001 on jobqueue t0 >>>> (cost=0.29..2312.39 rows=14634 width=161) >>>> (2 rows) >>>> <<<<<< >>>> >>>> Applying filtering criteria: >>>> >>>> >>>>>> >>>> dbname=# explain SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset >>>> FROM jobqueue t0 WHERE (t0.status='P' OR t0.status='G') AND >>>> t0.checkaction='R' AND t0.checktime<=2414771559577 ORDER BY t0.docpriority >>>> LIMIT 240; >>>> QUERY PLAN >>>> >>>> >>>> ------------------------------------------------------------------------------------------------ >>>> Limit (cost=0.29..68.98 rows=240 width=161) >>>> -> Index Scan using i1413214448001 on jobqueue t0 >>>> (cost=0.29..2389.31 rows=8347 width=161) >>>> Index Cond: ((checkaction = 'R'::bpchar) AND (checktime <= >>>> 2414771559577::bigint)) >>>> Filter: ((status = 'P'::bpchar) OR (status = 'G'::bpchar)) >>>> (4 rows) >>>> <<<<<< >>>> >>>> Now, this is a bit strange, because there is a filter on the status, so >>>> the index isn't helping there, and yet the index includes the status: >>>> "i1413214448001" btree (docpriority, status, checkaction, checktime) >>>> But maybe this is a statistical optimization? I don't know. >>>> >>>> Anyhow, as soon as a NOT EXISTS clause is added, you see an anti-join: >>>> >>>> >>>>>> >>>> dbname=# explain SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset >>>> FROM jobqueue t0 WHERE (t0.status='P' OR t0.status='G') AND >>>> t0.checkaction='R' AND t0.checktime<=2414771559577 AND NOT EXISTS(SELECT >>>> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND >>>> t3.eventname=t4.name) ORDER BY t0.docpriority LIMIT 240; >>>> QUERY PLAN >>>> >>>> >>>> ---------------------------------------------------------------------------------------------------------- >>>> Limit (cost=0.57..231.40 rows=240 width=161) >>>> -> Nested Loop Anti Join (cost=0.57..7957.24 rows=8273 width=161) >>>> -> Index Scan using i1413214448001 on jobqueue t0 >>>> (cost=0.29..2389.31 rows=8347 width=161) >>>> Index Cond: ((checkaction = 'R'::bpchar) AND (checktime >>>> <= 2414771559577::bigint)) >>>> Filter: ((status = 'P'::bpchar) OR (status = >>>> 'G'::bpchar)) >>>> -> Nested Loop (cost=0.29..0.66 rows=1 width=8) >>>> -> Index Scan using i1413214448007 on prereqevents t3 >>>> (cost=0.14..0.17 rows=1 width=524) >>>> Index Cond: (t0.id = owner) >>>> -> Index Only Scan using events_pkey on events t4 >>>> (cost=0.14..0.48 rows=1 width=516) >>>> Index Cond: (name = (t3.eventname)::text) >>>> (10 rows) >>>> <<<<<< >>>> >>>> The index scan is still there and is apparently providing the sorting, >>>> which is good still. It's just inside an anti-join. >>>> >>>> So you can take this further and add the two remaining EXISTS/NOT >>>> EXISTS clauses, and then your first plan makes sense. But your proposed >>>> *second* plan is clearly *not* using the docpriority index for sorting, >>>> since it sorts by itself later on: >>>> >>>> >>>>>> >>>> Limit (cost=122440.72..122441.42 rows=280 width=177) (actual >>>> time=1010.722..1010.722 rows=0 loops=1) >>>> -> Sort (cost=122440.72..122451.13 rows=4164 width=177) (actual >>>> time=1010.719..1010.719 rows=0 loops=1) >>>> Sort Key: t0.docpriority >>>> Sort Method: quicksort Memory: 25kB >>>> <<<<<< >>>> >>>> So what we need to understand now is why the first query is slow. It's >>>> reading from the correct index, but is nevertheless taking a long time, >>>> possibly because there are a lot of records with status 'P' or 'G' that >>>> belong to jobs that are not in state 'a' or 'A' that it has to go through >>>> before it assembles 240 results. Can you confirm this picture? >>>> >>>> Thanks, >>>> Karl >>>> >>>> >>>> On Fri, Oct 31, 2014 at 1:09 PM, Aeham Abushwashi < >>>> aeham.abushwashi@exonar.com> wrote: >>>> >>>>> That's great news, thanks Karl! On my part, I've been digging deeper >>>>> into >>>>> the execution plans of the two queries and I've made a similar >>>>> observation >>>>> of the ORDER BY skewing the original query's plan significantly. >>>>> However, I >>>>> have yet to figure out why the same ORDER BY clause doesn't have a >>>>> similar >>>>> impact on the modified query that uses JOIN. I'll look into this a bit >>>>> more >>>>> over the weekend. >>>>> >>>>> Best regards, >>>>> Aeham >>>>> >>>> >>>> >>> >> > --001a1132e2202cd3ae0506eb8574--