chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ari Rabkin (JIRA)" <>
Subject [jira] Commented: (CHUKWA-369) proposed reliability mechanism
Date Wed, 19 Aug 2009 03:45:14 GMT


Ari Rabkin commented on CHUKWA-369:

OK.  I've now modified AsycAckSender so that it can take a separate list of collectors that
should be used for checking file lengths.

But I just realized there are two deeper problems with my approach. 

1) Suppose that an Ack doesn't arrive. What then?  The code to rewind adaptors to the last
checkpoint and resume hasn't been written yet.  But I think it's pretty straightforward.
2) It's possible that an agent writes chunks 1,2 and 3 to collector A.  And then fails over
to collector B and writes chunks 4 and 5.  Supposing we get Acks for 1,2,4,5. The right thing
to do is to apply the acks for 1+2, hold the acks for 4 and 5, and then if the timeout occurs,
to restart from 3.  But right now, we just assume that an ack for chunk n+1 implies that chunks
0-n have all committed. This isn't really right.  

There's two plausible fixes. The first is to automatically reset each running adaptor whenever
we switch collectors. This makes (2) very easy to solve, at the expense of making dynamic
load-balancing harder.  The second is to use timeouts, and to really confront (2) head-on.

> proposed reliability mechanism
> ------------------------------
>                 Key: CHUKWA-369
>                 URL:
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: data collection
>    Affects Versions: 0.3.0
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>             Fix For: 0.3.0
>         Attachments: delayedAcks.patch
> We like to say that Chukwa is a system for reliable log collection. It isn't, quite,
since we don't handle collector crashes.  Here's a proposed reliability mechanism.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message