incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corbin Hoenes <>
Subject Re: SocketTeeWriter
Date Tue, 11 May 2010 18:24:39 GMT

Thanks you guys are spot on with your analysis of our demux issue--right now we have a single
data type.  We can probably split that into two different types later but still won't help
much until the partitioning is either pluggable or somewhat configurable as CHUKWA-481 states.

My questions about the Tee are more related to low latency requirements of creating more realtime
like feeds of our data.  My initial thought is that if I could get data out of hadoop in 10
or 5 minute intervals that it might be "good enough" for this so I was interested in speeding
up demux a bit.  But now I think the right thing will be using the Tee and getting the data
into a different system to create these feeds and let hadoop handle the large scale analysis

The Tee seems perfect...will have to try it out hoping to get feedback from people that are
using it like this.  Sounds like Ari does.

On May 11, 2010, at 12:03 PM, Eric Yang wrote:

> Corbin,
> Multiple collectors will improve the mapper processing speed, but the
> reducer is still the long tail of the demux processing. It sounds like you
> have large amount of same type of data.  It will definitely speed up your
> processing once CHUKWA-481 is addressed.
> Regards,
> Eric 
> On 5/10/10 7:34 PM, "Corbin Hoenes" <> wrote:
>> We are processing apache log files.    The current scale is 70-80GB per
>> day...but we'd like it to have a story for scaling up to move. Just checking
>> my collector logs it appears the data rate is still ranges from 600KB-1.2 MB.
>> This is all from one collector.  Does your setup use multiple collectors?  My
>> thought is that multiple collectors could be used to scale out once we reach a
>> data rate that caused issues for a single collector.
>> Any chance you know where that data rate is?
>> On May 10, 2010, at 5:37 PM, Ariel Rabkin wrote:
>>> That's how we use it at Berkeley, to process metrics from hundreds of
>>> machines; total data rate less than a megabyte per second, though.
>>> What scale of data are you looking at?
>>> The intent of SocketTee was if you need some subset of the data now,
>>> while write-to-HDFS-and-process-with-Hadoop is still the default path.
>>> What sort of low-latency processing do you need?
>>> --Ari
>>> On Mon, May 10, 2010 at 4:28 PM, Corbin Hoenes <> wrote:
>>>> Has anyone used the "Tee" in a larger scale deployment to try to get
>>>> real-time/low latency data?  Interested in how feasible it would be to use
>>>> it to pipe data into another system to handle these low latency requests
>>>> leave the long term analysis to hadoop.
>>> -- 
>>> Ari Rabkin
>>> UC Berkeley Computer Science Department

View raw message