Mailing-List: contact chukwa-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: chukwa-dev@hadoop.apache.org
Message-ID: <498865291.1250653514816.JavaMail.jira@brutus>
Date: Tue, 18 Aug 2009 20:45:14 -0700 (PDT)
From: "Ari Rabkin (JIRA)" <jira@apache.org>
To: chukwa-dev@hadoop.apache.org
Subject: [jira] Commented: (CHUKWA-369) proposed reliability mechanism
In-Reply-To: <2099955847.1249412175025.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744862#action_12744862 ] 

Ari Rabkin commented on CHUKWA-369:
-----------------------------------

OK.  I've now modified AsycAckSender so that it can take a separate list of collectors that should be used for checking file lengths.

But I just realized there are two deeper problems with my approach. 

1) Suppose that an Ack doesn't arrive. What then?  The code to rewind adaptors to the last checkpoint and resume hasn't been written yet.  But I think it's pretty straightforward.
2) It's possible that an agent writes chunks 1,2 and 3 to collector A.  And then fails over to collector B and writes chunks 4 and 5.  Supposing we get Acks for 1,2,4,5. The right thing to do is to apply the acks for 1+2, hold the acks for 4 and 5, and then if the timeout occurs, to restart from 3.  But right now, we just assume that an ack for chunk n+1 implies that chunks 0-n have all committed. This isn't really right.  

There's two plausible fixes. The first is to automatically reset each running adaptor whenever we switch collectors. This makes (2) very easy to solve, at the expense of making dynamic load-balancing harder.  The second is to use timeouts, and to really confront (2) head-on.

> proposed reliability mechanism
> ------------------------------
>
>                 Key: CHUKWA-369
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-369
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: data collection
>    Affects Versions: 0.3.0
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>             Fix For: 0.3.0
>
>         Attachments: delayedAcks.patch
>
>
> We like to say that Chukwa is a system for reliable log collection. It isn't, quite, since we don't handle collector crashes.  Here's a proposed reliability mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.