Return-Path: X-Original-To: apmail-manifoldcf-dev-archive@www.apache.org Delivered-To: apmail-manifoldcf-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 69B0C997E for ; Mon, 15 Dec 2014 11:55:39 +0000 (UTC) Received: (qmail 61139 invoked by uid 500); 15 Dec 2014 11:55:39 -0000 Delivered-To: apmail-manifoldcf-dev-archive@manifoldcf.apache.org Received: (qmail 61083 invoked by uid 500); 15 Dec 2014 11:55:39 -0000 Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@manifoldcf.apache.org Delivered-To: mailing list dev@manifoldcf.apache.org Received: (qmail 61071 invoked by uid 99); 15 Dec 2014 11:55:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Dec 2014 11:55:38 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of daddywri@gmail.com designates 209.85.160.171 as permitted sender) Received: from [209.85.160.171] (HELO mail-yk0-f171.google.com) (209.85.160.171) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Dec 2014 11:55:34 +0000 Received: by mail-yk0-f171.google.com with SMTP id 142so4850946ykq.16 for ; Mon, 15 Dec 2014 03:53:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=7A86wg4eTgwGM2eBBXMoE1+/E2+XamTvPPxYGs9bidc=; b=rCwqcSc55pWCjGsSzxXXej9ysQY+ldf9jcthcxDm95V9dgKw2+w836AQmvIUxSLQnO arnBfQKlznE7PklKRIMWAxjezHW8Ahv0tUxd6gil0RuCPYYp1B7V8L3YT8TG6+SNbveN Y6QJLY35Q5kYng207nnUaeflbYlLUKHNi35S3OqbzDP8WSP7tQJLmikHIdk0C612Ym1c 0DE6jgH+llH35SeZyNixwsVyXY587T+XNfb2r3+nJ5DZO9S8z+6T4Va2AekCW9rNxgaz yDeAtAvLQYw/Z+9UIf1YNrNyJNLqevbFiik5yQbOYAjJM66GTOTmh0V0wCY932cQ9JMg OYow== MIME-Version: 1.0 X-Received: by 10.236.24.225 with SMTP id x61mr21835411yhx.39.1418644423178; Mon, 15 Dec 2014 03:53:43 -0800 (PST) Received: by 10.170.205.65 with HTTP; Mon, 15 Dec 2014 03:53:43 -0800 (PST) In-Reply-To: References: Date: Mon, 15 Dec 2014 06:53:43 -0500 Message-ID: Subject: Re: Scaling in MCF From: Karl Wright To: dev Content-Type: multipart/alternative; boundary=089e0112ce2ae7fce2050a3fe479 X-Virus-Checked: Checked by ClamAV on apache.org --089e0112ce2ae7fce2050a3fe479 Content-Type: text/plain; charset=UTF-8 Hi Aeham, Unless MCF were to know in advance that each document prioritization bin was confined to a single job, it is not possible to mark only part of the documents for reprioritization at this time. But prioritization bins go cross-job, and that's where the problem lies. Adding or removing active documents requires reprioritization by definition, unless there would be a way to quickly determine which document bins were in fact represented in any given job. That's a relatively slow process, but we could try that at some point if you like. Karl On Mon, Dec 15, 2014 at 5:29 AM, Aeham Abushwashi < aeham.abushwashi@exonar.com> wrote: > > Below is a cutdown version of the pg_stat_activity dump... > > Could the the scope of the docpriority update be limited somehow (based on > needpriority?) to only those rows that need it? If a bunch of jobs are > started back to back (which I would say is a reasonably common use case > especially for continuous crawls), there will be a huge amount of repeated, > and therefore redundant, docpriority updates. Granted that the choice of > field for limiting the update scope may require an additional sql index. > > > xact_start | query_start | > state_change | waiting | state > | query > > -------------------------------+-------------------------------+-------------------------------+---------+--------+----------------------------------------------------------------------------------------------------------------------- > 2014-12-14 23:51:29.440873+00 | 2014-12-14 23:51:29.44226+00 | 2014-12-14 > 23:51:29.44226+00 | t | active | UPDATE jobqueue SET > docpriority=$1,needpriority=$2 WHERE docpriority<$3 > 2014-12-15 00:09:02.179227+00 | 2014-12-15 00:09:15.161376+00 | 2014-12-15 > 00:09:15.161376+00 | t | active | UPDATE jobqueue SET > status=$1,processid=$2 WHERE id=$3 > | 2014-12-15 00:16:51.936374+00 | 2014-12-15 > 00:16:51.936374+00 | f | idle | SELECT id FROM jobs WHERE status=$1 > | 2014-12-15 00:16:52.176358+00 | 2014-12-15 > 00:16:52.176402+00 | f | idle | SELECT * FROM agents > 2014-12-15 00:03:43.584173+00 | 2014-12-15 00:03:43.593023+00 | 2014-12-15 > 00:03:43.593023+00 | t | active | UPDATE jobqueue SET > docpriority=$1,needpriority=$2 WHERE docpriority<$3 > | 2014-12-15 00:16:48.157249+00 | 2014-12-15 > 00:16:48.157249+00 | f | idle | SELECT * FROM agents > 2014-12-15 00:16:54.550487+00 | 2014-12-15 00:16:54.550776+00 | 2014-12-15 > 00:16:54.550777+00 | f | active | SELECT id,dochash,docid,jobid FROM > jobqueue WHERE needpriority=$1 LIMIT 1000 > 2014-12-15 00:09:02.097583+00 | 2014-12-15 00:09:02.107445+00 | 2014-12-15 > 00:09:02.107445+00 | f | active | UPDATE jobqueue SET > docpriority=$1,needpriority=$2 WHERE docpriority<$3 > 2014-12-15 00:09:03.795408+00 | 2014-12-15 00:09:03.870265+00 | 2014-12-15 > 00:09:03.870266+00 | t | active | SELECT id,status,checktime FROM > jobqueue WHERE dochash=$1 AND jobid=$2 FOR UPDATE > 2014-12-15 00:13:12.2401+00 | 2014-12-15 00:13:12.254646+00 | 2014-12-15 > 00:13:12.254647+00 | t | active | SELECT id,status,checktime FROM > jobqueue WHERE dochash=$1 AND jobid=$2 FOR UPDATE > | 2014-12-15 00:16:55.490487+00 | 2014-12-15 > 00:16:55.490511+00 | f | idle | SELECT id FROM jobs WHERE status=$1 > 2014-12-15 00:07:55.403813+00 | 2014-12-15 00:07:55.403813+00 | 2014-12-15 > 00:07:55.403813+00 | f | active | autovacuum: VACUUM public.jobqueue > 2014-12-15 00:16:56.690037+00 | 2014-12-15 00:16:56.690037+00 | 2014-12-15 > 00:16:56.690037+00 | f | active | SELECT * FROM pg_stat_activity > WHERE datname = 'crawlerperf' AND query <> 'COMMIT' ORDER BY client_addr, > query_start; > --089e0112ce2ae7fce2050a3fe479--