Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@manifoldcf.apache.org
Received-SPF: pass (nike.apache.org: domain of daddywri@gmail.com designates
 209.85.213.44 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAKoYS6JbttnhD9Tea3Ar2FQwpv4hRLZwE2tN7Wpge-CVG2ie=A@mail.gmail.com>
References: 
 <CAKoYS6KYqJSJbXUZeDWjfDjiXo-rLfX7nK2ohjLsNc+xU9EJiw@mail.gmail.com>
	<CALUFAGAsV3-Wn-+wmFPt68XkGiACS1Y+XrGZeizQmP+o=aAuSQ@mail.gmail.com>
	<CALUFAGC0qC6jvPqFZhkCm7kph_kwpSvr9DSAzEcnXkj=4Cwh+g@mail.gmail.com>
	<CAKoYS6JbttnhD9Tea3Ar2FQwpv4hRLZwE2tN7Wpge-CVG2ie=A@mail.gmail.com>
Date: Thu, 30 Oct 2014 21:04:52 -0400
Message-ID: 
 <CALUFAGAVSzkLdNmpn7+PYJ3jjzNgOXhTccT5G11W8iwLx5zD=w@mail.gmail.com>
Subject: Re: (Continuous) crawl performance
From: Karl Wright <daddywri@gmail.com>
To: dev <dev@manifoldcf.apache.org>
Content-Type: multipart/alternative; boundary=089e0115ee9c9ae19b0506ad9554

--089e0115ee9c9ae19b0506ad9554
Content-Type: text/plain; charset=UTF-8

Hi Aeham,

The stuffer query is designed to make use of the following index, and
indeed to read rows in sorting order FROM that index:

      IndexDescription docpriorityIndex = new IndexDescription(false,new
String[]{docPriorityField,statusField,checkActionField,checkTimeField});

All that the joins do is effectively filter the results based on job status
and based on whether a given document is actually present in and being
worked on by another job.  The second case is rare, so any plan that
emphasizes the jobqueue interaction is plain wrong.  I could see the first
join being meaningful in some cases, but only if you had lots of inactive
jobs which each had lots of documents. Please let me know if that is the
situation here.

It is a mystery why PostgreSQL does not always use this index.  In my
experience, the main cause is due to PostgreSQL's cost optimizer making
huge distinctions between when there is zero of something vs one of
something.  You can tell readily enough if this is what is happening by
getting a plan for the stuffer query, and if it is NOT reading from the
index I mentioned, doing ANALYZE jobqueue, and then getting the plan
again.  If the second plan is fine, then this previously noted PostgreSQL
problem is to blame.

I'm afraid that your notion of breaking up the query and doing it in Java
is extremely problematic, because then we would be forced to read huge
resultsets into memory.  If you demonstrate that an ANALYZE needs to be
done more frequently, on the other hand, there is a configuration parameter
for that.  Please let me know.

Karl

On Thu, Oct 30, 2014 at 8:28 PM, Aeham Abushwashi <
aeham.abushwashi@exonar.com> wrote:

> Hi Karl,
>
> I do have that change. Paul Boichat, who was involved in the conversation
> leading up to CONNECTORS-1027, applied the fix and tested it in our
> environment.
>
> I see your point that query #1 still has the IN (...) clause indicating
> that it hasn't benefited from the CONNECTORS-1027 fix. However, in my
> execution plan that part of the query is never executed anyway - see below
> for details. In addition, query #2 is clean yet it hits the database hard.
>
> ======================
>
> Execution Plan for Query #1:
>
>  EXPLAIN ANALYZE SELECT
> t0.id
> ,t0.docpriority,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
> FROM jobqueue t0  WHERE  (t0.status='P' OR t0.status='G') AND
> t0.checkaction='R' AND t0.checktime<= 1414683318802 AND EXISTS(SELECT 'x'
> FROM jobs t1 WHERE  (t1.status='A' OR t1.status='a') AND t1.id=t0.jobid
> AND
> t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
> t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND
> t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events
> t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority
> ASC LIMIT 280;
>
> QUERY PLAN
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.00..85910.03 rows=280 width=178) (actual
> time=15152.258..15152.258 rows=0 loops=1)
>    ->  Nested Loop Anti Join  (cost=0.00..3187261.91 rows=10388 width=178)
> (actual time=15152.255..15152.255 rows=0 loops=1)
>          Join Filter: ((t2.jobid <> t0.jobid) AND ((t2.dochash)::text =
> (t0.dochash)::text))
>          ->  Nested Loop Anti Join  (cost=0.00..3186114.69 rows=10388
> width=178) (actual time=15152.253..15152.253 rows=0 loops=1)
>                ->  Nested Loop Semi Join  (cost=0.00..3177053.22 rows=10388
> width=178) (actual time=15152.252..15152.252 rows=0 loops=1)
>                      ->  Index Scan using i1392985450172 on jobqueue t0
> (cost=0.00..3141756.78 rows=124662 width=178) (actual
> time=686.159..4046.926 rows=3346768 loops=1)
>                            Index Cond: ((checkaction = 'R'::bpchar) AND
> (checktime <= 1414683318802::bigint))
>                            Filter: ((status = 'P'::bpchar) OR (status =
> 'G'::bpchar))
>                      ->  Index Scan using jobs_pkey on jobs t1
> (cost=0.00..0.28 rows=1 width=8) (actual time=0.003..0.003 rows=0
> loops=3346768)
>                            Index Cond: (id = t0.jobid)
>                            Filter: ((priority = 5) AND ((status =
> 'A'::bpchar) OR (status = 'a'::bpchar)))
>                            Rows Removed by Filter: 1
>                ->  Nested Loop  (cost=0.00..0.86 rows=1 width=8) (never
> executed)
>                      ->  Index Scan using i1392985450178 on prereqevents
> t3  (cost=0.00..0.27 rows=1 width=524) (never executed)
>                            Index Cond: (t0.id = owner)
>                      ->  Index Only Scan using events_pkey on events t4
> (cost=0.00..0.58 rows=1 width=516) (never executed)
>                            Index Cond: (name = (t3.eventname)::text)
>                            Heap Fetches: 0
>          ->  Materialize  (cost=0.00..56.48 rows=6 width=49) (never
> executed)
>                ->  Index Scan using i1392985450173 on jobqueue t2
> (cost=0.00..56.45 rows=6 width=49) (never executed)
>                      Index Cond: (status = ANY ('{A,F,a,f,D,d}'::bpchar[]))
>  Total runtime: 15152.487 ms
>
>
> Execution Plan for Query #2:
>
> EXPLAIN ANALYZE SELECT docpriority,jobid,dochash,docid FROM jobqueue t0
> WHERE  (status='P' OR status='G') AND checkaction='R' AND checktime<=
> 1414685283327 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE  (t1.status='A' OR
> t1.status='a') AND t1.id=t0.jobid)  ORDER BY docpriority ASC LIMIT 1;
>
> QUERY PLAN
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.00..304.78 rows=1 width=144) (actual
> time=20620.458..20620.458 rows=0 loops=1)
>    ->  Nested Loop Semi Join  (cost=0.00..3029548.13 rows=9940 width=144)
> (actual time=20620.456..20620.456 rows=0 loops=1)
>          ->  Index Scan using i1392985450172 on jobqueue t0
> (cost=0.00..2996093.69 rows=119282 width=144) (actual time=44.118..5665.201
> rows=3346768 loops=1)
>                Index Cond: ((checkaction = 'R'::bpchar) AND (checktime <=
> 1414685283327::bigint))
>                Filter: ((status = 'P'::bpchar) OR (status = 'G'::bpchar))
>          ->  Index Scan using jobs_pkey on jobs t1  (cost=0.00..0.28 rows=1
> width=8) (actual time=0.004..0.004 rows=0 loops=3346768)
>                Index Cond: (id = t0.jobid)
>                Filter: ((status = 'A'::bpchar) OR (status = 'a'::bpchar))
>                Rows Removed by Filter: 1
>  Total runtime: 20620.642 ms
>
> ======================
>
>
> Best regards,
> Aeham
>
> On 30 October 2014 23:45, Karl Wright <daddywri@gmail.com> wrote:
>
> > The ticket, in case you are interested, is CONNECTORS-1027.
> >
> > Karl
> >
> >
> > On Thu, Oct 30, 2014 at 7:42 PM, Karl Wright <daddywri@gmail.com> wrote:
> >
> > > Hi Aeham,
> > >
> > > The problem is that you are not using MCF 1.7.1, which has a fix for
> the
> > > stuffer thread query that makes it run quickly on PostgreSQL 9.1 and
> > above.
> > >
> > > Thanks,
> > > Karl
> > >
> > >
> > > On Thu, Oct 30, 2014 at 7:30 PM, Aeham Abushwashi <
> > > aeham.abushwashi@exonar.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> I have a multi-node zookeeper-based test environment with ~29M
> documents
> > >> (~36M jobqueue records and ~29M ingeststatus records) and jobs in
> > varying
> > >> states of being (running vs complete vs stopped vs paused). Some jobs
> > are
> > >> continuous and a few are not.
> > >>
> > >> During a quiet time -when no documents are actively ingested- I see
> very
> > >> high CPU utilisation on the Postgresql database server. This is
> > primarily
> > >> a
> > >> result of two queries executed by the Stuffer Thread; namely the main
> > >> query
> > >> in JobManager#fetchAndProcessDocuments and the doc priority query in
> > >> JobManager#getNextDocuments
> > >>
> > >> ==========================
> > >>
> > >> Query 1:
> > >> SELECT t0.id
> > >>
> >
> ,t0.docpriority,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
> > >> FROM jobqueue t0  WHERE  (t0.status=? OR t0.status=?) AND
> > t0.checkaction=?
> > >> AND t0.checktime<=? AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
> > (t1.status=?
> > >> OR t1.status=?) AND t1.id=t0.jobid AND t1.priority=?) AND NOT
> > >> EXISTS(SELECT
> > >> 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN
> > >> (?,?,?,?,?,?) AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM
> > >> prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=
> t4.name
> > )
> > >> ORDER BY t0.docpriority ASC LIMIT 280
> > >>
> > >> Query 1 example params:
> > >> [P, G, R, 1414683318802, A, a, 5, A, F, a, f, D, d]
> > >>
> > >> ++++++++
> > >>
> > >> Query 2:
> > >> SELECT docpriority,jobid,dochash,docid FROM jobqueue t0  WHERE
> > (status=?
> > >> OR status=?) AND checkaction=? AND checktime<=? AND EXISTS(SELECT 'x'
> > FROM
> > >> jobs t1 WHERE  (t1.status=? OR t1.status=?) AND t1.id=t0.jobid)
> ORDER
> > BY
> > >> docpriority ASC LIMIT 1
> > >>
> > >> Query 2 example params:
> > >> [P, G, R, 1414685283327, A, a]
> > >>
> > >> ==========================
> > >>
> > >> There is one continuous job in a running state and which has ~50K
> > >> documents
> > >> in a pending-purgatory ('G') state but not due for processing until
> > >> another
> > >> couple of days. There are a few other jobs which are stopped and have
> a
> > >> few
> > >> million documents in a pending-purgatory state. I don't expect
> Manifold
> > to
> > >> find any documents eligible for processing in this case and it
> doesn't,
> > >> but
> > >> it takes time and (more seriously) a big chunk of Postgresql CPU to
> find
> > >> the answer, 1 full CPU core for ~15 secs per MCF instance. This
> wouldn't
> > >> be
> > >> a big deal if the queries were executed infrequently. Unfortunately,
> > they
> > >> are run in a loop with a 2 second pause between iterations.
> > >> I'd argue that it would be relatively easy to run into this issue
> with a
> > >> bunch of large continuous jobs scheduled to run at varying times of
> the
> > >> day.
> > >>
> > >> With my limited knowledge of the inner workings of the Stuffer Thread,
> > >> it's
> > >> not obvious to me why query 2 needs to run IF query 1 returns no hits
> at
> > >> all. If it's not needed, then short-circuitting it in this case would
> > >> help.
> > >>
> > >> Increasing the sleep time between iterations of the Stuffer Thread, in
> > >> response to there being little or no work for a number of consecutive
> > >> iterations, could be another easy win.
> > >>
> > >> More profoundly would be to optimise the joins between the jobs table
> > and
> > >> the jobqueue table. One way to do that would be to move some of the
> > logic
> > >> to the Java code to first identify active jobs and then query the
> > jobqueue
> > >> table just for those. I understand this would require merging and
> > sorting
> > >> priorities in the Java code which is less than ideal. I continue to
> > >> analyse
> > >> the execution plans of the two queries and experiment with slight
> > variants
> > >> to test performance. I'll report back any important findings.
> > >>
> > >> Any thoughts or advice are welcome. Also if have experience running
> > large
> > >> continuous crawls I'd be interested to hear from you particularly on
> > your
> > >> experience with performance and especially the performance of SQL
> > queries.
> > >>
> > >> Best regards,
> > >> Aeham
> > >>
> > >
> > >
> >
>

--089e0115ee9c9ae19b0506ad9554--