Return-Path: Delivered-To: apmail-continuum-dev-archive@www.apache.org Received: (qmail 59910 invoked from network); 30 Sep 2009 07:36:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 30 Sep 2009 07:36:07 -0000 Received: (qmail 8096 invoked by uid 500); 30 Sep 2009 07:36:06 -0000 Delivered-To: apmail-continuum-dev-archive@continuum.apache.org Received: (qmail 8007 invoked by uid 500); 30 Sep 2009 07:36:06 -0000 Mailing-List: contact dev-help@continuum.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@continuum.apache.org Delivered-To: mailing list dev@continuum.apache.org Received: (qmail 7997 invoked by uid 99); 30 Sep 2009 07:36:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Sep 2009 07:36:06 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.211.198] (HELO mail-yw0-f198.google.com) (209.85.211.198) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Sep 2009 07:35:57 +0000 Received: by ywh36 with SMTP id 36so6556117ywh.15 for ; Wed, 30 Sep 2009 00:35:35 -0700 (PDT) Received: by 10.91.97.9 with SMTP id z9mr4264979agl.46.1254296135787; Wed, 30 Sep 2009 00:35:35 -0700 (PDT) Received: from ?10.0.0.1? ([58.165.190.145]) by mx.google.com with ESMTPS id 17sm1531217agd.66.2009.09.30.00.35.33 (version=TLSv1/SSLv3 cipher=RC4-MD5); Wed, 30 Sep 2009 00:35:34 -0700 (PDT) Sender: Brett Porter Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes Mime-Version: 1.0 (Apple Message framework v1076) Subject: Re: What should happen when a distributed agent dies? From: Brett Porter In-Reply-To: Date: Wed, 30 Sep 2009 17:35:29 +1000 Content-Transfer-Encoding: 7bit Message-Id: References: To: dev@continuum.apache.org X-Mailer: Apple Mail (2.1076) X-Virus-Checked: Checked by ClamAV on apache.org So I went through the cluster of issues you opened and put them in 1.3.5, then realised that we had already kind of settled on the list of things for 1.3.5 :) Should we: 1) keep them where they are? 2) push them to 1.3.6? 3) push them to 1.4.0? 4) or are they already addressed by Marica's other fix? Cheers, Brett On 29/09/2009, at 3:03 AM, Wendy Smoak wrote: > I've been working with Distributed Builds lately, and I've found that > it works if everything is perfect, but if something goes wrong it has > a hard time coping with the problem, and it doesn't recover. > > For example, it's a given that at some point, an agent is going to die > without being properly removed first. > > Currently if this happens, the Queues page breaks (error/stack trace) > and you can't edit or delete the offending agent to disable or get rid > of it. > > The agent is also still shown as 'enabled' on the Distributed Agents > page even though it's not responding. > > What should happen in this case? > > I'm all for having the system automatically disable any agent that is > not behaving properly. At first, the admin may have to manually > re-enable it. In the future we might come up with a way for it to > auto-recover. > > Thoughts? > > -- > Wendy