incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerome Boulon <>
Subject Re: SocketTeeWriter
Date Tue, 11 May 2010 18:37:04 GMT
Hey Corbin,

What kind of partitioner do you need?
I'm using one based on a hashing function of the key.
Let me know if that would work for you?

Regarding the TeeWriter, I would like to also get feedback on it, Ari?


On 5/11/10 11:24 AM, "Corbin Hoenes" <> wrote:

> Eric,
> Thanks you guys are spot on with your analysis of our demux issue--right now
> we have a single data type.  We can probably split that into two different
> types later but still won't help much until the partitioning is either
> pluggable or somewhat configurable as CHUKWA-481 states.
> My questions about the Tee are more related to low latency requirements of
> creating more realtime like feeds of our data.  My initial thought is that if
> I could get data out of hadoop in 10 or 5 minute intervals that it might be
> "good enough" for this so I was interested in speeding up demux a bit.  But
> now I think the right thing will be using the Tee and getting the data into a
> different system to create these feeds and let hadoop handle the large scale
> analysis only.
> The Tee seems perfect...will have to try it out hoping to get feedback from
> people that are using it like this.  Sounds like Ari does.
> On May 11, 2010, at 12:03 PM, Eric Yang wrote:
>> Corbin,
>> Multiple collectors will improve the mapper processing speed, but the
>> reducer is still the long tail of the demux processing. It sounds like you
>> have large amount of same type of data.  It will definitely speed up your
>> processing once CHUKWA-481 is addressed.
>> Regards,
>> Eric 
>> On 5/10/10 7:34 PM, "Corbin Hoenes" <> wrote:
>>> We are processing apache log files.    The current scale is 70-80GB per
>>> day...but we'd like it to have a story for scaling up to move. Just checking
>>> my collector logs it appears the data rate is still ranges from 600KB-1.2
>>> MB.
>>> This is all from one collector.  Does your setup use multiple collectors?
>>> My
>>> thought is that multiple collectors could be used to scale out once we reach
>>> a
>>> data rate that caused issues for a single collector.
>>> Any chance you know where that data rate is?
>>> On May 10, 2010, at 5:37 PM, Ariel Rabkin wrote:
>>>> That's how we use it at Berkeley, to process metrics from hundreds of
>>>> machines; total data rate less than a megabyte per second, though.
>>>> What scale of data are you looking at?
>>>> The intent of SocketTee was if you need some subset of the data now,
>>>> while write-to-HDFS-and-process-with-Hadoop is still the default path.
>>>> What sort of low-latency processing do you need?
>>>> --Ari
>>>> On Mon, May 10, 2010 at 4:28 PM, Corbin Hoenes <> wrote:
>>>>> Has anyone used the "Tee" in a larger scale deployment to try to get
>>>>> real-time/low latency data?  Interested in how feasible it would be to
>>>>> it to pipe data into another system to handle these low latency requests
>>>>> and
>>>>> leave the long term analysis to hadoop.
>>>> -- 
>>>> Ari Rabkin
>>>> UC Berkeley Computer Science Department

View raw message