manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Scaling in MCF
Date Mon, 15 Dec 2014 00:57:29 GMT
Hi Aeham,

FWIW, it's pretty difficult reading the dump you provided.  If you could
include only the pertinent columns, that would help a lot.

>>>>>>
My guess is that as the stuffer thread picks up items and updates their
status and process id, it can get blocked (indirectly?) by the
repriotisation query issued when a job is started. This causes the stuffer
thread to stall and subsequently no documents are processed by any node in
the cluster.
<<<<<<

Starting a job now must mark all active jobqueue records as needing
document priorities.  This necessarily takes quite some time, independently
of actually computing the document priorities, which happens later.  Since
the query that signals that prioritization is needed is not done in chunks,
nothing else gets past it either.

I knew this was likely to happen when I reorganized the code around
document prioritization based on earlier conversations we've had.  If you
recall, I expressed concern that any activity that requires large numbers
of jobqueue records to be written would in fact eventually be the factor
that limits MCF's ability to scale.  But I have found no solution to the
problem, since document prioritization must be done with all active
documents in mind.

Thanks,
Karl




On Sun, Dec 14, 2014 at 7:40 PM, Aeham Abushwashi <
aeham.abushwashi@exonar.com> wrote:
>
> Hi Karl,
>
> I have some analysis to share wrt job starting performance...
>
> After starting a handful of new jobs, my 3-node mcf cluster (dev_1x and
> already populated with ~10M jobqueue records) appeared to have stalled. The
> stuffer threads on two nodes were waiting on the stuffer lock. The stuffer
> thread on the third node was blocked on the execution of the sql query in
> JobQueue#updateActiveRecord.
>
> My guess is that as the stuffer thread picks up items and updates their
> status and process id, it can get blocked (indirectly?) by the
> repriotisation query issued when a job is started. This causes the stuffer
> thread to stall and subsequently no documents are processed by any node in
> the cluster.
>
> Below is a dump of the pg_stat_activity table. Note the values of the
> 'waiting' column (true/false). The second row corresponds to the query
> invoked by JobQueue#updateActiveRecord
>
>  datid  |   datname   |  pid  | usesysid | usename  | application_name |
> client_addr | client_hostname | client_port |         backend_start
> |          xact_start           |          query_start          |
> state_change
>         | waiting | state
> |                                                         query
>
> --------+-------------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-----------------------
>
> --------+---------+--------+-----------------------------------------------------------------------------------------------------------------------
>  400109 | crawlerperf | 16001 |    16384 | slurp    |                  |
> 10.250.0.23 |                 |       52790 | 2014-12-14 23:19:25.909125+00
> | 2014-12-14 23:51:29.440873+00 | 2014-12-14 23:51:29.44226+00  |
> 2014-12-14 23:51:29.44
> 226+00  | t       | active | UPDATE jobqueue SET
> docpriority=$1,needpriority=$2 WHERE docpriority<$3
>  400109 | crawlerperf | 17020 |    16384 | slurp    |                  |
> 10.250.0.23 |                 |       52813 | 2014-12-14 23:44:31.286417+00
> | 2014-12-15 00:09:02.179227+00 | 2014-12-15 00:09:15.161376+00 |
> 2014-12-15 00:09:15.16
> 1376+00 | t       | active | UPDATE jobqueue SET status=$1,processid=$2
> WHERE id=$3
>  400109 | crawlerperf | 16744 |    16384 | slurp    |                  |
> 10.250.0.23 |                 |       52807 | 2014-12-14 23:37:22.181826+00
> |                               | 2014-12-15 00:16:51.936374+00 |
> 2014-12-15 00:16:51.93
> 6374+00 | f       | idle   | SELECT id FROM jobs WHERE status=$1
>  400109 | crawlerperf | 17022 |    16384 | slurp    |                  |
> 10.250.0.23 |                 |       52815 | 2014-12-14 23:44:31.402114+00
> |                               | 2014-12-15 00:16:52.176358+00 |
> 2014-12-15 00:16:52.17
> 6402+00 | f       | idle   | SELECT * FROM agents
>  400109 | crawlerperf | 14258 |    16384 | slurp    |                  |
> 10.250.0.33 |                 |       55885 | 2014-12-14 22:43:56.316824+00
> | 2014-12-15 00:03:43.584173+00 | 2014-12-15 00:03:43.593023+00 |
> 2014-12-15 00:03:43.59
> 3023+00 | t       | active | UPDATE jobqueue SET
> docpriority=$1,needpriority=$2 WHERE docpriority<$3
>  400109 | crawlerperf | 13832 |    16384 | slurp    |                  |
> 10.250.0.33 |                 |       55882 | 2014-12-14 22:30:47.752513+00
> |                               | 2014-12-15 00:16:48.157249+00 |
> 2014-12-15 00:16:48.15
> 7249+00 | f       | idle   | SELECT * FROM agents
>  400109 | crawlerperf | 17745 |    16384 | slurp    |                  |
> 10.250.0.33 |                 |       55901 | 2014-12-14 23:56:01.490378+00
> | 2014-12-15 00:16:54.550487+00 | 2014-12-15 00:16:54.550776+00 |
> 2014-12-15 00:16:54.55
> 0777+00 | f       | active | SELECT id,dochash,docid,jobid FROM jobqueue
> WHERE needpriority=$1 LIMIT 1000
>  400109 | crawlerperf | 16992 |    16384 | slurp    |                  |
> 10.250.0.43 |                 |       51521 | 2014-12-14 23:43:53.198025+00
> | 2014-12-15 00:09:02.097583+00 | 2014-12-15 00:09:02.107445+00 |
> 2014-12-15 00:09:02.10
> 7445+00 | f       | active | UPDATE jobqueue SET
> docpriority=$1,needpriority=$2 WHERE docpriority<$3
>  400109 | crawlerperf | 14907 |    16384 | slurp    |                  |
> 10.250.0.43 |                 |       51462 | 2014-12-14 22:56:20.025004+00
> | 2014-12-15 00:09:03.795408+00 | 2014-12-15 00:09:03.870265+00 |
> 2014-12-15 00:09:03.87
> 0266+00 | t       | active | SELECT id,status,checktime FROM jobqueue WHERE
> dochash=$1 AND jobid=$2 FOR UPDATE
>  400109 | crawlerperf | 18028 |    16384 | slurp    |                  |
> 10.250.0.43 |                 |       51535 | 2014-12-15 00:03:37.741002+00
> | 2014-12-15 00:13:12.2401+00   | 2014-12-15 00:13:12.254646+00 |
> 2014-12-15 00:13:12.25
> 4647+00 | t       | active | SELECT id,status,checktime FROM jobqueue WHERE
> dochash=$1 AND jobid=$2 FOR UPDATE
>  400109 | crawlerperf | 15976 |    16384 | slurp    |                  |
> 10.250.0.43 |                 |       51490 | 2014-12-14 23:18:51.369753+00
> |                               | 2014-12-15 00:16:55.490487+00 |
> 2014-12-15 00:16:55.49
> 0511+00 | f       | idle   | SELECT id FROM jobs WHERE status=$1
>  400109 | crawlerperf | 18175 |       10 | postgres |
> |             |                 |             | 2014-12-15
> 00:07:55.204579+00 | 2014-12-15 00:07:55.403813+00 | 2014-12-15
> 00:07:55.403813+00 | 2014-12-15 00:07:55.40
> 3813+00 | f       | active | autovacuum: VACUUM public.jobqueue
>  400109 | crawlerperf | 17632 |       10 | postgres | psql
> |             |                 |          -1 | 2014-12-14
> 23:52:55.506248+00 | 2014-12-15 00:16:56.690037+00 | 2014-12-15
> 00:16:56.690037+00 | 2014-12-15 00:16:56.69
> 0037+00 | f       | active | SELECT * FROM pg_stat_activity WHERE datname =
> 'crawlerperf' AND query <> 'COMMIT' ORDER BY client_addr, query_start;
> (13 rows)
>
>
> Cheers,
> Aeham
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message