incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Rabkin <>
Subject Re: SocketTeeWriter
Date Tue, 11 May 2010 18:48:44 GMT
Not quite sure what you want to know. We've been using it
successfully. Total data rates aren't enormous; a few MB/sec/collector
I think, but it's been benchmarked well past that.  The SocketTee was
particularly designed for the case where some data loss is OK. It
won't buffer for later delivery; you have to wait until the HDFS copy
is available.


On Tue, May 11, 2010 at 11:37 AM, Jerome Boulon <> wrote:
> Hey Corbin,
> What kind of partitioner do you need?
> I'm using one based on a hashing function of the key.
> Let me know if that would work for you?
> Regarding the TeeWriter, I would like to also get feedback on it, Ari?
> /Jerome.
> On 5/11/10 11:24 AM, "Corbin Hoenes" <> wrote:
>> Eric,
>> Thanks you guys are spot on with your analysis of our demux issue--right now
>> we have a single data type.  We can probably split that into two different
>> types later but still won't help much until the partitioning is either
>> pluggable or somewhat configurable as CHUKWA-481 states.
>> My questions about the Tee are more related to low latency requirements of
>> creating more realtime like feeds of our data.  My initial thought is that if
>> I could get data out of hadoop in 10 or 5 minute intervals that it might be
>> "good enough" for this so I was interested in speeding up demux a bit.  But
>> now I think the right thing will be using the Tee and getting the data into a
>> different system to create these feeds and let hadoop handle the large scale
>> analysis only.
>> The Tee seems perfect...will have to try it out hoping to get feedback from
>> people that are using it like this.  Sounds like Ari does.
>> On May 11, 2010, at 12:03 PM, Eric Yang wrote:
>>> Corbin,
>>> Multiple collectors will improve the mapper processing speed, but the
>>> reducer is still the long tail of the demux processing. It sounds like you
>>> have large amount of same type of data.  It will definitely speed up your
>>> processing once CHUKWA-481 is addressed.
>>> Regards,
>>> Eric
>>> On 5/10/10 7:34 PM, "Corbin Hoenes" <> wrote:
>>>> We are processing apache log files.    The current scale is 70-80GB per
>>>> day...but we'd like it to have a story for scaling up to move. Just checking
>>>> my collector logs it appears the data rate is still ranges from 600KB-1.2
>>>> MB.
>>>> This is all from one collector.  Does your setup use multiple collectors?
>>>> My
>>>> thought is that multiple collectors could be used to scale out once we reach
>>>> a
>>>> data rate that caused issues for a single collector.
>>>> Any chance you know where that data rate is?
>>>> On May 10, 2010, at 5:37 PM, Ariel Rabkin wrote:
>>>>> That's how we use it at Berkeley, to process metrics from hundreds of
>>>>> machines; total data rate less than a megabyte per second, though.
>>>>> What scale of data are you looking at?
>>>>> The intent of SocketTee was if you need some subset of the data now,
>>>>> while write-to-HDFS-and-process-with-Hadoop is still the default path.
>>>>> What sort of low-latency processing do you need?
>>>>> --Ari
>>>>> On Mon, May 10, 2010 at 4:28 PM, Corbin Hoenes <>
>>>>>> Has anyone used the "Tee" in a larger scale deployment to try to
>>>>>> real-time/low latency data?  Interested in how feasible it would
be to use
>>>>>> it to pipe data into another system to handle these low latency requests
>>>>>> and
>>>>>> leave the long term analysis to hadoop.
>>>>> --
>>>>> Ari Rabkin
>>>>> UC Berkeley Computer Science Department

Ari Rabkin
UC Berkeley Computer Science Department

View raw message