chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Graham <billgra...@gmail.com>
Subject Re: a more fault-tolerant collector
Date Tue, 12 Oct 2010 17:25:14 GMT
We had problems with a single DN the other day and all collectors
ultimately died after trying N failed attempts. I believe at least one
of the failures was was during a commit.

I think backpressure sounds like the right approach, bit it seems like
there would be some practical challenges, particularly around async
writes or commits to HDFS. Ari, how does this behavior work currently,
and would this be difficult to handle?

In oahc.datacolleciton.writer.SeqFileWriter there's an exception block
with this in it:

// We don't want to loose anything
log.fatal("IOException when trying to write a chunk, Collector is
going to exit!", e);
DaemonWatcher.bailout(-1);
isRunning = false;



On Tue, Oct 12, 2010 at 9:57 AM, Eric Yang <eyang@yahoo-inc.com> wrote:
> I thought that is what it is currently doing with one twist, the commit and response
is async.  Collector exits if the file system is unavailable for extensive period of time.
 If it is not doing what's described above, then we definitely should fix it.
>
> Regards,
> Eric
>
>
> On 10/11/10 10:49 PM, "Ariel Rabkin" <asrabkin@gmail.com> wrote:
>
> Howdy.
>
> This is an answer to a question Bill asked me recently: can we
> redesign the Collector process to behave better if the filesystem is
> unavailable?
>
> I think we can do this by backpressure. If the write fails, the
> collector should return an error to the agent. And the agent should
> treat it like a post failure, and retry.  Thoughts?
>
> --Ar
>
> --
> Ari Rabkin asrabkin@gmail.com
> UC Berkeley Computer Science Department
>
>

Mime
View raw message