incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Yang <>
Subject Re: Begin a discussion about Chukwa as a top level project
Date Fri, 09 Apr 2010 20:31:03 GMT
For building an end to end data collection, analysis and visualization
system for distributed system, there are many components that are not in the
Hadoop ecosystem.  For example, Post processed data could load to hbase
voldemort, mysql, hive, or pig+zebra.  Each component is targeted toward
different type of use cases.  Chukwa collector could run as jetty webservers
or replaced with tomcat.  For having many swappable components, it may make
more sense to have Chukwa as a top level project.  This is the reason that I
vote for TLP.


On 4/9/10 11:36 AM, "Ariel Rabkin" <> wrote:

> Howdy.
> Thinking harder about the long-term roadmap sounds like a good idea. I
> think I agree with Jerome's overall philosophy. Thinking of Chukwa as
> a toolkit for putting together monitoring deployments sounds right as
> the overall vision.  And I think it's fine to have one [or more]
> "default configurations" that suit a large fraction of the user base.
> The original adaptor-agent-collector-HDFS pipeline was designed to
> cope with large numbers of small sources; logfiles, metrics, and so
> forth. But it's become increasingly clear to me that a big fraction of
> our user base want to use Chukwa to move and process much higher-rate
> sources, like clickthrough logs. And in that case, the first steps of
> the pipeline aren't really right.   We do have things like the
> backfilling tool to help cope with this. But it'd be nice to make
> everything clean and neat, rather than a special case.
> --Ari
> On Fri, Apr 9, 2010 at 11:24 AM, Jerome Boulon <> wrote:
>> I would like to take advantage of this to see a long term roadmap for Chukwa
>> before anything else.
>> I personally start Chukwa, 2 years ago to explore and push the Hadoop limits
>> moving from a pure batch system to something in between online analytics and
>> hourly/daily analytics but still on top of Hadoop.
>> My personal goal is to have a robust data collection pipeline and a robust
>> processing pipeline on top of Hadoop eco-system.
>> I personally don¹t need any UI just a robust and efficient backend that can
>> get the job done.
>> I would like to be able to natively talk to any data store
>> (Hive/Zebra/Hbase/Voldemort/...) if it make sense from a user perspective
>> but I don¹t want chukwa to be yet another NOSQL project.
>> Also, the more I¹m using Chukwa in different place to solve different
>> problems, the more I think that Chukwa should be an SDK instead of trying to
>> be an end-to-end system.
>> People have different agenda, requirements, some needs to use Avro some
>> Thrift, some Pig or Hive, etc.
>> Having a one size fit all seems good but I can see some issues in trying to
>> have an end-to-end system for everyone. Just an example, I¹m running a
>> modified Chukwa¹s version in production since I need to load the Demux
>> output to Hive. This will make some people happy but some will not be able
>> to use Hive, so what should I do? Commit my changes that will brake the
>> current workflow and loose some of our users??
>> On the other end, some are just using the data collection pipeline and not
>> the Demux, pushing the data to another store directly from ChukwaCollector.
>>  I can see some valuable components here to stream directly to Mysql, Hbase
>> or Voldemort. Or like others, you may want to optimize the data store for
>> online display only. All of them have valid points and cases but if we are
>> not an SDK but a product that you install then those choices will make some
>> people unhappy.
>> I¹m not saying that I don¹t want a product at the end, maybe it¹s just a
>> refactoring to split Chukwa tree in components and then have different
>> pipeline/assemblies.
>> I think that we should first clearly state why or why not we are working on
>> or just using Chukwa and I hope that based on this discussion the choice
>> will be easier.

View raw message