manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Scaling in MCF
Date Mon, 15 Dec 2014 11:53:43 GMT
Hi Aeham,

Unless MCF were to know in advance that each document prioritization bin
was confined to a single job, it is not possible to mark only part of the
documents for reprioritization at this time.  But prioritization bins go
cross-job, and that's where the problem lies.  Adding or removing active
documents requires reprioritization by definition, unless there would be a
way to quickly determine which document bins were in fact represented in
any given job.  That's a relatively slow process, but we could try that at
some point if you like.

Karl


On Mon, Dec 15, 2014 at 5:29 AM, Aeham Abushwashi <
aeham.abushwashi@exonar.com> wrote:
>
> Below is a cutdown version of the pg_stat_activity dump...
>
> Could the the scope of the docpriority update be limited somehow (based on
> needpriority?) to only those rows that need it? If a bunch of jobs are
> started back to back (which I would say is a reasonably common use case
> especially for continuous crawls), there will be a huge amount of repeated,
> and therefore redundant, docpriority updates. Granted that the choice of
> field for limiting the update scope may require an additional sql index.
>
>
>           xact_start           |          query_start          |
> state_change          | waiting | state
> |                                                         query
>
> -------------------------------+-------------------------------+-------------------------------+---------+--------+-----------------------------------------------------------------------------------------------------------------------
>  2014-12-14 23:51:29.440873+00 | 2014-12-14 23:51:29.44226+00  | 2014-12-14
> 23:51:29.44226+00  | t       | active | UPDATE jobqueue SET
> docpriority=$1,needpriority=$2 WHERE docpriority<$3
>  2014-12-15 00:09:02.179227+00 | 2014-12-15 00:09:15.161376+00 | 2014-12-15
> 00:09:15.161376+00 | t       | active | UPDATE jobqueue SET
> status=$1,processid=$2 WHERE id=$3
>                                | 2014-12-15 00:16:51.936374+00 | 2014-12-15
> 00:16:51.936374+00 | f       | idle   | SELECT id FROM jobs WHERE status=$1
>                                | 2014-12-15 00:16:52.176358+00 | 2014-12-15
> 00:16:52.176402+00 | f       | idle   | SELECT * FROM agents
>  2014-12-15 00:03:43.584173+00 | 2014-12-15 00:03:43.593023+00 | 2014-12-15
> 00:03:43.593023+00 | t       | active | UPDATE jobqueue SET
> docpriority=$1,needpriority=$2 WHERE docpriority<$3
>                                | 2014-12-15 00:16:48.157249+00 | 2014-12-15
> 00:16:48.157249+00 | f       | idle   | SELECT * FROM agents
>  2014-12-15 00:16:54.550487+00 | 2014-12-15 00:16:54.550776+00 | 2014-12-15
> 00:16:54.550777+00 | f       | active | SELECT id,dochash,docid,jobid FROM
> jobqueue WHERE needpriority=$1 LIMIT 1000
>  2014-12-15 00:09:02.097583+00 | 2014-12-15 00:09:02.107445+00 | 2014-12-15
> 00:09:02.107445+00 | f       | active | UPDATE jobqueue SET
> docpriority=$1,needpriority=$2 WHERE docpriority<$3
>  2014-12-15 00:09:03.795408+00 | 2014-12-15 00:09:03.870265+00 | 2014-12-15
> 00:09:03.870266+00 | t       | active | SELECT id,status,checktime FROM
> jobqueue WHERE dochash=$1 AND jobid=$2 FOR UPDATE
>  2014-12-15 00:13:12.2401+00   | 2014-12-15 00:13:12.254646+00 | 2014-12-15
> 00:13:12.254647+00 | t       | active | SELECT id,status,checktime FROM
> jobqueue WHERE dochash=$1 AND jobid=$2 FOR UPDATE
>                                | 2014-12-15 00:16:55.490487+00 | 2014-12-15
> 00:16:55.490511+00 | f       | idle   | SELECT id FROM jobs WHERE status=$1
>  2014-12-15 00:07:55.403813+00 | 2014-12-15 00:07:55.403813+00 | 2014-12-15
> 00:07:55.403813+00 | f       | active | autovacuum: VACUUM public.jobqueue
>  2014-12-15 00:16:56.690037+00 | 2014-12-15 00:16:56.690037+00 | 2014-12-15
> 00:16:56.690037+00 | f       | active | SELECT * FROM pg_stat_activity
> WHERE datname = 'crawlerperf' AND query <> 'COMMIT' ORDER BY client_addr,
> query_start;
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message