Return-Path: Delivered-To: apmail-hadoop-chukwa-dev-archive@minotaur.apache.org Received: (qmail 34094 invoked from network); 19 Aug 2009 03:45:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 19 Aug 2009 03:45:17 -0000 Received: (qmail 2222 invoked by uid 500); 19 Aug 2009 03:45:36 -0000 Delivered-To: apmail-hadoop-chukwa-dev-archive@hadoop.apache.org Received: (qmail 2191 invoked by uid 500); 19 Aug 2009 03:45:36 -0000 Mailing-List: contact chukwa-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: chukwa-dev@hadoop.apache.org Delivered-To: mailing list chukwa-dev@hadoop.apache.org Received: (qmail 2176 invoked by uid 99); 19 Aug 2009 03:45:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Aug 2009 03:45:36 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Aug 2009 03:45:35 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id C867429A0011 for ; Tue, 18 Aug 2009 20:45:14 -0700 (PDT) Message-ID: <498865291.1250653514816.JavaMail.jira@brutus> Date: Tue, 18 Aug 2009 20:45:14 -0700 (PDT) From: "Ari Rabkin (JIRA)" To: chukwa-dev@hadoop.apache.org Subject: [jira] Commented: (CHUKWA-369) proposed reliability mechanism In-Reply-To: <2099955847.1249412175025.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744862#action_12744862 ] Ari Rabkin commented on CHUKWA-369: ----------------------------------- OK. I've now modified AsycAckSender so that it can take a separate list of collectors that should be used for checking file lengths. But I just realized there are two deeper problems with my approach. 1) Suppose that an Ack doesn't arrive. What then? The code to rewind adaptors to the last checkpoint and resume hasn't been written yet. But I think it's pretty straightforward. 2) It's possible that an agent writes chunks 1,2 and 3 to collector A. And then fails over to collector B and writes chunks 4 and 5. Supposing we get Acks for 1,2,4,5. The right thing to do is to apply the acks for 1+2, hold the acks for 4 and 5, and then if the timeout occurs, to restart from 3. But right now, we just assume that an ack for chunk n+1 implies that chunks 0-n have all committed. This isn't really right. There's two plausible fixes. The first is to automatically reset each running adaptor whenever we switch collectors. This makes (2) very easy to solve, at the expense of making dynamic load-balancing harder. The second is to use timeouts, and to really confront (2) head-on. > proposed reliability mechanism > ------------------------------ > > Key: CHUKWA-369 > URL: https://issues.apache.org/jira/browse/CHUKWA-369 > Project: Hadoop Chukwa > Issue Type: New Feature > Components: data collection > Affects Versions: 0.3.0 > Reporter: Ari Rabkin > Assignee: Ari Rabkin > Fix For: 0.3.0 > > Attachments: delayedAcks.patch > > > We like to say that Chukwa is a system for reliable log collection. It isn't, quite, since we don't handle collector crashes. Here's a proposed reliability mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.