manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Scaling in MCF
Date Mon, 15 Dec 2014 12:22:24 GMT
Hi Aeham,

I've created ticket CONNECTORS-1122 to track any ideas that people have to
make job start be faster under the new document reprioritization situation.

Karl


On Mon, Dec 15, 2014 at 6:53 AM, Karl Wright <daddywri@gmail.com> wrote:
>
> Hi Aeham,
>
> Unless MCF were to know in advance that each document prioritization bin
> was confined to a single job, it is not possible to mark only part of the
> documents for reprioritization at this time.  But prioritization bins go
> cross-job, and that's where the problem lies.  Adding or removing active
> documents requires reprioritization by definition, unless there would be a
> way to quickly determine which document bins were in fact represented in
> any given job.  That's a relatively slow process, but we could try that at
> some point if you like.
>
> Karl
>
>
> On Mon, Dec 15, 2014 at 5:29 AM, Aeham Abushwashi <
> aeham.abushwashi@exonar.com> wrote:
>>
>> Below is a cutdown version of the pg_stat_activity dump...
>>
>> Could the the scope of the docpriority update be limited somehow (based on
>> needpriority?) to only those rows that need it? If a bunch of jobs are
>> started back to back (which I would say is a reasonably common use case
>> especially for continuous crawls), there will be a huge amount of
>> repeated,
>> and therefore redundant, docpriority updates. Granted that the choice of
>> field for limiting the update scope may require an additional sql index.
>>
>>
>>           xact_start           |          query_start          |
>> state_change          | waiting | state
>> |                                                         query
>>
>> -------------------------------+-------------------------------+-------------------------------+---------+--------+-----------------------------------------------------------------------------------------------------------------------
>>  2014-12-14 23:51:29.440873+00 | 2014-12-14 23:51:29.44226+00  |
>> 2014-12-14
>> 23:51:29.44226+00  | t       | active | UPDATE jobqueue SET
>> docpriority=$1,needpriority=$2 WHERE docpriority<$3
>>  2014-12-15 00:09:02.179227+00 | 2014-12-15 00:09:15.161376+00 |
>> 2014-12-15
>> 00:09:15.161376+00 | t       | active | UPDATE jobqueue SET
>> status=$1,processid=$2 WHERE id=$3
>>                                | 2014-12-15 00:16:51.936374+00 |
>> 2014-12-15
>> 00:16:51.936374+00 | f       | idle   | SELECT id FROM jobs WHERE
>> status=$1
>>                                | 2014-12-15 00:16:52.176358+00 |
>> 2014-12-15
>> 00:16:52.176402+00 | f       | idle   | SELECT * FROM agents
>>  2014-12-15 00:03:43.584173+00 | 2014-12-15 00:03:43.593023+00 |
>> 2014-12-15
>> 00:03:43.593023+00 | t       | active | UPDATE jobqueue SET
>> docpriority=$1,needpriority=$2 WHERE docpriority<$3
>>                                | 2014-12-15 00:16:48.157249+00 |
>> 2014-12-15
>> 00:16:48.157249+00 | f       | idle   | SELECT * FROM agents
>>  2014-12-15 00:16:54.550487+00 | 2014-12-15 00:16:54.550776+00 |
>> 2014-12-15
>> 00:16:54.550777+00 | f       | active | SELECT id,dochash,docid,jobid FROM
>> jobqueue WHERE needpriority=$1 LIMIT 1000
>>  2014-12-15 00:09:02.097583+00 | 2014-12-15 00:09:02.107445+00 |
>> 2014-12-15
>> 00:09:02.107445+00 | f       | active | UPDATE jobqueue SET
>> docpriority=$1,needpriority=$2 WHERE docpriority<$3
>>  2014-12-15 00:09:03.795408+00 | 2014-12-15 00:09:03.870265+00 |
>> 2014-12-15
>> 00:09:03.870266+00 | t       | active | SELECT id,status,checktime FROM
>> jobqueue WHERE dochash=$1 AND jobid=$2 FOR UPDATE
>>  2014-12-15 00:13:12.2401+00   | 2014-12-15 00:13:12.254646+00 |
>> 2014-12-15
>> 00:13:12.254647+00 | t       | active | SELECT id,status,checktime FROM
>> jobqueue WHERE dochash=$1 AND jobid=$2 FOR UPDATE
>>                                | 2014-12-15 00:16:55.490487+00 |
>> 2014-12-15
>> 00:16:55.490511+00 | f       | idle   | SELECT id FROM jobs WHERE
>> status=$1
>>  2014-12-15 00:07:55.403813+00 | 2014-12-15 00:07:55.403813+00 |
>> 2014-12-15
>> 00:07:55.403813+00 | f       | active | autovacuum: VACUUM public.jobqueue
>>  2014-12-15 00:16:56.690037+00 | 2014-12-15 00:16:56.690037+00 |
>> 2014-12-15
>> 00:16:56.690037+00 | f       | active | SELECT * FROM pg_stat_activity
>> WHERE datname = 'crawlerperf' AND query <> 'COMMIT' ORDER BY client_addr,
>> query_start;
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message