Return-Path: X-Original-To: apmail-manifoldcf-dev-archive@www.apache.org Delivered-To: apmail-manifoldcf-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E284317E7D for ; Fri, 31 Oct 2014 01:05:20 +0000 (UTC) Received: (qmail 13249 invoked by uid 500); 31 Oct 2014 01:05:20 -0000 Delivered-To: apmail-manifoldcf-dev-archive@manifoldcf.apache.org Received: (qmail 13199 invoked by uid 500); 31 Oct 2014 01:05:20 -0000 Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@manifoldcf.apache.org Delivered-To: mailing list dev@manifoldcf.apache.org Received: (qmail 13178 invoked by uid 99); 31 Oct 2014 01:05:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Oct 2014 01:05:20 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of daddywri@gmail.com designates 209.85.213.44 as permitted sender) Received: from [209.85.213.44] (HELO mail-yh0-f44.google.com) (209.85.213.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Oct 2014 01:04:53 +0000 Received: by mail-yh0-f44.google.com with SMTP id f10so2238229yha.3 for ; Thu, 30 Oct 2014 18:04:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=q1ReJqvGsW8vwir8nUUWlSjYOlYcmP7eSILnHBhwpsw=; b=capizGWimgMNrFniQWfDLT1s8kK0STdTtnR+vRTNzOGKfSdkckq1HDBqPAPT9kZ5TH QMgzOlubpZ3N/hMklRd5HF19v+InN/bt1vCW/37qfMUpP2GLWDzr83nzuxDeeNTfocb0 xfjQBwPdF/6LZwSjzifP+VHm7xrCZQnYVpTPjGYBCgFO1Oxs51ES0L6Nhe7GWJajApHm eH/l0fWdgDg3A0keYvBFetzS7XARjWvWc4QtohavQhkEQQqxGr7Siq6Un9Vbtc7Cyua7 OYqfLD+UbHYDK9FasHYh3sl5dBdyLW2LqI6zWO4F8Siz3tJyUjfSO6HXxCIbT+EXF9Sq SLlw== MIME-Version: 1.0 X-Received: by 10.236.63.163 with SMTP id a23mr10572899yhd.41.1414717492615; Thu, 30 Oct 2014 18:04:52 -0700 (PDT) Received: by 10.170.189.67 with HTTP; Thu, 30 Oct 2014 18:04:52 -0700 (PDT) In-Reply-To: References: Date: Thu, 30 Oct 2014 21:04:52 -0400 Message-ID: Subject: Re: (Continuous) crawl performance From: Karl Wright To: dev Content-Type: multipart/alternative; boundary=089e0115ee9c9ae19b0506ad9554 X-Virus-Checked: Checked by ClamAV on apache.org --089e0115ee9c9ae19b0506ad9554 Content-Type: text/plain; charset=UTF-8 Hi Aeham, The stuffer query is designed to make use of the following index, and indeed to read rows in sorting order FROM that index: IndexDescription docpriorityIndex = new IndexDescription(false,new String[]{docPriorityField,statusField,checkActionField,checkTimeField}); All that the joins do is effectively filter the results based on job status and based on whether a given document is actually present in and being worked on by another job. The second case is rare, so any plan that emphasizes the jobqueue interaction is plain wrong. I could see the first join being meaningful in some cases, but only if you had lots of inactive jobs which each had lots of documents. Please let me know if that is the situation here. It is a mystery why PostgreSQL does not always use this index. In my experience, the main cause is due to PostgreSQL's cost optimizer making huge distinctions between when there is zero of something vs one of something. You can tell readily enough if this is what is happening by getting a plan for the stuffer query, and if it is NOT reading from the index I mentioned, doing ANALYZE jobqueue, and then getting the plan again. If the second plan is fine, then this previously noted PostgreSQL problem is to blame. I'm afraid that your notion of breaking up the query and doing it in Java is extremely problematic, because then we would be forced to read huge resultsets into memory. If you demonstrate that an ANALYZE needs to be done more frequently, on the other hand, there is a configuration parameter for that. Please let me know. Karl On Thu, Oct 30, 2014 at 8:28 PM, Aeham Abushwashi < aeham.abushwashi@exonar.com> wrote: > Hi Karl, > > I do have that change. Paul Boichat, who was involved in the conversation > leading up to CONNECTORS-1027, applied the fix and tested it in our > environment. > > I see your point that query #1 still has the IN (...) clause indicating > that it hasn't benefited from the CONNECTORS-1027 fix. However, in my > execution plan that part of the query is never executed anyway - see below > for details. In addition, query #2 is clean yet it hits the database hard. > > ====================== > > Execution Plan for Query #1: > > EXPLAIN ANALYZE SELECT > t0.id > ,t0.docpriority,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset > FROM jobqueue t0 WHERE (t0.status='P' OR t0.status='G') AND > t0.checkaction='R' AND t0.checktime<= 1414683318802 AND EXISTS(SELECT 'x' > FROM jobs t1 WHERE (t1.status='A' OR t1.status='a') AND t1.id=t0.jobid > AND > t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE > t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND > t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events > t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority > ASC LIMIT 280; > > QUERY PLAN > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Limit (cost=0.00..85910.03 rows=280 width=178) (actual > time=15152.258..15152.258 rows=0 loops=1) > -> Nested Loop Anti Join (cost=0.00..3187261.91 rows=10388 width=178) > (actual time=15152.255..15152.255 rows=0 loops=1) > Join Filter: ((t2.jobid <> t0.jobid) AND ((t2.dochash)::text = > (t0.dochash)::text)) > -> Nested Loop Anti Join (cost=0.00..3186114.69 rows=10388 > width=178) (actual time=15152.253..15152.253 rows=0 loops=1) > -> Nested Loop Semi Join (cost=0.00..3177053.22 rows=10388 > width=178) (actual time=15152.252..15152.252 rows=0 loops=1) > -> Index Scan using i1392985450172 on jobqueue t0 > (cost=0.00..3141756.78 rows=124662 width=178) (actual > time=686.159..4046.926 rows=3346768 loops=1) > Index Cond: ((checkaction = 'R'::bpchar) AND > (checktime <= 1414683318802::bigint)) > Filter: ((status = 'P'::bpchar) OR (status = > 'G'::bpchar)) > -> Index Scan using jobs_pkey on jobs t1 > (cost=0.00..0.28 rows=1 width=8) (actual time=0.003..0.003 rows=0 > loops=3346768) > Index Cond: (id = t0.jobid) > Filter: ((priority = 5) AND ((status = > 'A'::bpchar) OR (status = 'a'::bpchar))) > Rows Removed by Filter: 1 > -> Nested Loop (cost=0.00..0.86 rows=1 width=8) (never > executed) > -> Index Scan using i1392985450178 on prereqevents > t3 (cost=0.00..0.27 rows=1 width=524) (never executed) > Index Cond: (t0.id = owner) > -> Index Only Scan using events_pkey on events t4 > (cost=0.00..0.58 rows=1 width=516) (never executed) > Index Cond: (name = (t3.eventname)::text) > Heap Fetches: 0 > -> Materialize (cost=0.00..56.48 rows=6 width=49) (never > executed) > -> Index Scan using i1392985450173 on jobqueue t2 > (cost=0.00..56.45 rows=6 width=49) (never executed) > Index Cond: (status = ANY ('{A,F,a,f,D,d}'::bpchar[])) > Total runtime: 15152.487 ms > > > Execution Plan for Query #2: > > EXPLAIN ANALYZE SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 > WHERE (status='P' OR status='G') AND checkaction='R' AND checktime<= > 1414685283327 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE (t1.status='A' OR > t1.status='a') AND t1.id=t0.jobid) ORDER BY docpriority ASC LIMIT 1; > > QUERY PLAN > > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > Limit (cost=0.00..304.78 rows=1 width=144) (actual > time=20620.458..20620.458 rows=0 loops=1) > -> Nested Loop Semi Join (cost=0.00..3029548.13 rows=9940 width=144) > (actual time=20620.456..20620.456 rows=0 loops=1) > -> Index Scan using i1392985450172 on jobqueue t0 > (cost=0.00..2996093.69 rows=119282 width=144) (actual time=44.118..5665.201 > rows=3346768 loops=1) > Index Cond: ((checkaction = 'R'::bpchar) AND (checktime <= > 1414685283327::bigint)) > Filter: ((status = 'P'::bpchar) OR (status = 'G'::bpchar)) > -> Index Scan using jobs_pkey on jobs t1 (cost=0.00..0.28 rows=1 > width=8) (actual time=0.004..0.004 rows=0 loops=3346768) > Index Cond: (id = t0.jobid) > Filter: ((status = 'A'::bpchar) OR (status = 'a'::bpchar)) > Rows Removed by Filter: 1 > Total runtime: 20620.642 ms > > ====================== > > > Best regards, > Aeham > > On 30 October 2014 23:45, Karl Wright wrote: > > > The ticket, in case you are interested, is CONNECTORS-1027. > > > > Karl > > > > > > On Thu, Oct 30, 2014 at 7:42 PM, Karl Wright wrote: > > > > > Hi Aeham, > > > > > > The problem is that you are not using MCF 1.7.1, which has a fix for > the > > > stuffer thread query that makes it run quickly on PostgreSQL 9.1 and > > above. > > > > > > Thanks, > > > Karl > > > > > > > > > On Thu, Oct 30, 2014 at 7:30 PM, Aeham Abushwashi < > > > aeham.abushwashi@exonar.com> wrote: > > > > > >> Hi, > > >> > > >> I have a multi-node zookeeper-based test environment with ~29M > documents > > >> (~36M jobqueue records and ~29M ingeststatus records) and jobs in > > varying > > >> states of being (running vs complete vs stopped vs paused). Some jobs > > are > > >> continuous and a few are not. > > >> > > >> During a quiet time -when no documents are actively ingested- I see > very > > >> high CPU utilisation on the Postgresql database server. This is > > primarily > > >> a > > >> result of two queries executed by the Stuffer Thread; namely the main > > >> query > > >> in JobManager#fetchAndProcessDocuments and the doc priority query in > > >> JobManager#getNextDocuments > > >> > > >> ========================== > > >> > > >> Query 1: > > >> SELECT t0.id > > >> > > > ,t0.docpriority,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset > > >> FROM jobqueue t0 WHERE (t0.status=? OR t0.status=?) AND > > t0.checkaction=? > > >> AND t0.checktime<=? AND EXISTS(SELECT 'x' FROM jobs t1 WHERE > > (t1.status=? > > >> OR t1.status=?) AND t1.id=t0.jobid AND t1.priority=?) AND NOT > > >> EXISTS(SELECT > > >> 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN > > >> (?,?,?,?,?,?) AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM > > >> prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname= > t4.name > > ) > > >> ORDER BY t0.docpriority ASC LIMIT 280 > > >> > > >> Query 1 example params: > > >> [P, G, R, 1414683318802, A, a, 5, A, F, a, f, D, d] > > >> > > >> ++++++++ > > >> > > >> Query 2: > > >> SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 WHERE > > (status=? > > >> OR status=?) AND checkaction=? AND checktime<=? AND EXISTS(SELECT 'x' > > FROM > > >> jobs t1 WHERE (t1.status=? OR t1.status=?) AND t1.id=t0.jobid) > ORDER > > BY > > >> docpriority ASC LIMIT 1 > > >> > > >> Query 2 example params: > > >> [P, G, R, 1414685283327, A, a] > > >> > > >> ========================== > > >> > > >> There is one continuous job in a running state and which has ~50K > > >> documents > > >> in a pending-purgatory ('G') state but not due for processing until > > >> another > > >> couple of days. There are a few other jobs which are stopped and have > a > > >> few > > >> million documents in a pending-purgatory state. I don't expect > Manifold > > to > > >> find any documents eligible for processing in this case and it > doesn't, > > >> but > > >> it takes time and (more seriously) a big chunk of Postgresql CPU to > find > > >> the answer, 1 full CPU core for ~15 secs per MCF instance. This > wouldn't > > >> be > > >> a big deal if the queries were executed infrequently. Unfortunately, > > they > > >> are run in a loop with a 2 second pause between iterations. > > >> I'd argue that it would be relatively easy to run into this issue > with a > > >> bunch of large continuous jobs scheduled to run at varying times of > the > > >> day. > > >> > > >> With my limited knowledge of the inner workings of the Stuffer Thread, > > >> it's > > >> not obvious to me why query 2 needs to run IF query 1 returns no hits > at > > >> all. If it's not needed, then short-circuitting it in this case would > > >> help. > > >> > > >> Increasing the sleep time between iterations of the Stuffer Thread, in > > >> response to there being little or no work for a number of consecutive > > >> iterations, could be another easy win. > > >> > > >> More profoundly would be to optimise the joins between the jobs table > > and > > >> the jobqueue table. One way to do that would be to move some of the > > logic > > >> to the Java code to first identify active jobs and then query the > > jobqueue > > >> table just for those. I understand this would require merging and > > sorting > > >> priorities in the Java code which is less than ideal. I continue to > > >> analyse > > >> the execution plans of the two queries and experiment with slight > > variants > > >> to test performance. I'll report back any important findings. > > >> > > >> Any thoughts or advice are welcome. Also if have experience running > > large > > >> continuous crawls I'd be interested to hear from you particularly on > > your > > >> experience with performance and especially the performance of SQL > > queries. > > >> > > >> Best regards, > > >> Aeham > > >> > > > > > > > > > --089e0115ee9c9ae19b0506ad9554--