chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ari Rabkin (JIRA)" <>
Subject [jira] Commented: (CHUKWA-369) proposed reliability mechanism
Date Wed, 05 Aug 2009 20:31:14 GMT


Ari Rabkin commented on CHUKWA-369:

@Jerome: I don't see how bailing out to avoid disk-full solves the problem of collectors crashing.
 The failure scenario I'm worried about is that LocalWriter writes the data, and then the
collector dies in a non-recoverable way.  The data on disk is now useless, and the Right Thing
is for the agent to retransmit to a different collector.

Certainly, HDFS improvements would help reduce this problem. But I think I can implement my
proposal here in a week or so -- and that gets us reliability even with previous versions
of the filesystem.

@Eric: The point about overloading the name node is a fair one. Let me propose the following

- Instead of querying HDFS directly, agents should do a GET request to a collector.  The collector
has to do only a single list every few minutes, and cache the results, to satisfy all the
agents. This radically cuts down on traffic to namenode.  Plus isolates the Chukwa DFS from

> proposed reliability mechanism
> ------------------------------
>                 Key: CHUKWA-369
>                 URL:
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: data collection
>    Affects Versions: 0.3.0
>            Reporter: Ari Rabkin
>             Fix For: 0.3.0
> We like to say that Chukwa is a system for reliable log collection. It isn't, quite,
since we don't handle collector crashes.  Here's a proposed reliability mechanism.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message