manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Scaling in MCF
Date Mon, 15 Dec 2014 12:59:24 GMT
I attached a patch which turns off reprioritization for job starts and job
resumes.  This will mean that jobs with bins that overlap other jobs will
not make any progress until the documents from those other jobs are
processed, but that may be acceptable to you.

Karl


On Mon, Dec 15, 2014 at 7:22 AM, Karl Wright <daddywri@gmail.com> wrote:
>
> Hi Aeham,
>
> I've created ticket CONNECTORS-1122 to track any ideas that people have to
> make job start be faster under the new document reprioritization situation.
>
> Karl
>
>
> On Mon, Dec 15, 2014 at 6:53 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> Hi Aeham,
>>
>> Unless MCF were to know in advance that each document prioritization bin
>> was confined to a single job, it is not possible to mark only part of the
>> documents for reprioritization at this time.  But prioritization bins go
>> cross-job, and that's where the problem lies.  Adding or removing active
>> documents requires reprioritization by definition, unless there would be a
>> way to quickly determine which document bins were in fact represented in
>> any given job.  That's a relatively slow process, but we could try that at
>> some point if you like.
>>
>> Karl
>>
>>
>> On Mon, Dec 15, 2014 at 5:29 AM, Aeham Abushwashi <
>> aeham.abushwashi@exonar.com> wrote:
>>>
>>> Below is a cutdown version of the pg_stat_activity dump...
>>>
>>> Could the the scope of the docpriority update be limited somehow (based
>>> on
>>> needpriority?) to only those rows that need it? If a bunch of jobs are
>>> started back to back (which I would say is a reasonably common use case
>>> especially for continuous crawls), there will be a huge amount of
>>> repeated,
>>> and therefore redundant, docpriority updates. Granted that the choice of
>>> field for limiting the update scope may require an additional sql index.
>>>
>>>
>>>           xact_start           |          query_start          |
>>> state_change          | waiting | state
>>> |                                                         query
>>>
>>> -------------------------------+-------------------------------+-------------------------------+---------+--------+-----------------------------------------------------------------------------------------------------------------------
>>>  2014-12-14 23:51:29.440873+00 | 2014-12-14 23:51:29.44226+00  |
>>> 2014-12-14
>>> 23:51:29.44226+00  | t       | active | UPDATE jobqueue SET
>>> docpriority=$1,needpriority=$2 WHERE docpriority<$3
>>>  2014-12-15 00:09:02.179227+00 | 2014-12-15 00:09:15.161376+00 |
>>> 2014-12-15
>>> 00:09:15.161376+00 | t       | active | UPDATE jobqueue SET
>>> status=$1,processid=$2 WHERE id=$3
>>>                                | 2014-12-15 00:16:51.936374+00 |
>>> 2014-12-15
>>> 00:16:51.936374+00 | f       | idle   | SELECT id FROM jobs WHERE
>>> status=$1
>>>                                | 2014-12-15 00:16:52.176358+00 |
>>> 2014-12-15
>>> 00:16:52.176402+00 | f       | idle   | SELECT * FROM agents
>>>  2014-12-15 00:03:43.584173+00 | 2014-12-15 00:03:43.593023+00 |
>>> 2014-12-15
>>> 00:03:43.593023+00 | t       | active | UPDATE jobqueue SET
>>> docpriority=$1,needpriority=$2 WHERE docpriority<$3
>>>                                | 2014-12-15 00:16:48.157249+00 |
>>> 2014-12-15
>>> 00:16:48.157249+00 | f       | idle   | SELECT * FROM agents
>>>  2014-12-15 00:16:54.550487+00 | 2014-12-15 00:16:54.550776+00 |
>>> 2014-12-15
>>> 00:16:54.550777+00 | f       | active | SELECT id,dochash,docid,jobid
>>> FROM
>>> jobqueue WHERE needpriority=$1 LIMIT 1000
>>>  2014-12-15 00:09:02.097583+00 | 2014-12-15 00:09:02.107445+00 |
>>> 2014-12-15
>>> 00:09:02.107445+00 | f       | active | UPDATE jobqueue SET
>>> docpriority=$1,needpriority=$2 WHERE docpriority<$3
>>>  2014-12-15 00:09:03.795408+00 | 2014-12-15 00:09:03.870265+00 |
>>> 2014-12-15
>>> 00:09:03.870266+00 | t       | active | SELECT id,status,checktime FROM
>>> jobqueue WHERE dochash=$1 AND jobid=$2 FOR UPDATE
>>>  2014-12-15 00:13:12.2401+00   | 2014-12-15 00:13:12.254646+00 |
>>> 2014-12-15
>>> 00:13:12.254647+00 | t       | active | SELECT id,status,checktime FROM
>>> jobqueue WHERE dochash=$1 AND jobid=$2 FOR UPDATE
>>>                                | 2014-12-15 00:16:55.490487+00 |
>>> 2014-12-15
>>> 00:16:55.490511+00 | f       | idle   | SELECT id FROM jobs WHERE
>>> status=$1
>>>  2014-12-15 00:07:55.403813+00 | 2014-12-15 00:07:55.403813+00 |
>>> 2014-12-15
>>> 00:07:55.403813+00 | f       | active | autovacuum: VACUUM
>>> public.jobqueue
>>>  2014-12-15 00:16:56.690037+00 | 2014-12-15 00:16:56.690037+00 |
>>> 2014-12-15
>>> 00:16:56.690037+00 | f       | active | SELECT * FROM pg_stat_activity
>>> WHERE datname = 'crawlerperf' AND query <> 'COMMIT' ORDER BY client_addr,
>>> query_start;
>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message