incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Rabkin <>
Subject Re: chukwa suitability for collecting data from multiple datanodes to HDFS
Date Tue, 30 Jun 2009 08:35:05 GMT

We're in the final stages of rolling a 0.2 release.  Probably you're
better off playing with Trunk, though; I think the two are about
equally reliable but there are a few bug fixes that are trunk only.
And if you're doing any coding, it should be against trunk.

One warning. Documentation is still sparse. Don't hesitate to email us
if you have any questions.  Anything that's confusing to you, we need
to document or fix.

On Tue, Jun 30, 2009 at 1:09 AM, Ken White<> wrote:

> Thank you for the pointers! I will take a look at the code and see if anything
> can be done to improve reliability.

Quite a bit can be done.  There's basically four outstanding issues I
know of that prevent us from saying "we have a really robust

- CHUKWA-284 [patch available] fixes one of them.
- The second is that agents assume that any time a collector returns
an OK to a post, that the data is the collector's problem; but
collectors assume that they can then lose data.
- The third, related issue, is that the collector itself has only a
loose notion of when data is stable, because HDFS hasn't got a real
- The last is that we don't do duplicate detection where you want us
to. We handle duplicates at the very end of the pipeline, where we
update the database that drives the visualizations. But you probably
want to do it at the archiving stage. I just opened CHUKWA-338 for

> If nothing else, I could be running two instances of chukwa on different nodes
> and write the important data to both of them.

Hrm.  That's actually both harder and easier than you think.
Harder, because Collectors have pretty strong expectations for what
their input looks like: it needs to be a list of Chukwa chunks in the
same format used by the ChukwaHttpSender. Which I don't think has any
documentation more accurate than the code. And because the duplicate
detection won't do the right thing in that scenario.  (See above about

Easier, because I think you can actually reuse ChukwaHttpSender.
Probably, the thing to do is to create a new class of Chukwa Connector
that contains multiple senders.


Ari Rabkin
UC Berkeley Computer Science Department

View raw message