incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corbin Hoenes <>
Subject Re: SocketTeeWriter
Date Wed, 12 May 2010 20:11:20 GMT

I would like to take a look at your partitioner if possible to see if it'll work for us. 
 I am not sure what would be best to partition on.  I am thinking a hash of the ChukwaArchiveKey.getTimePartition()
would be a decent partitioner--but I'm still a noob so not sure of the criteria for a good

Did you just modify ChukwaRecordPartitioner?

On May 11, 2010, at 12:37 PM, Jerome Boulon wrote:

> Hey Corbin,
> What kind of partitioner do you need?
> I'm using one based on a hashing function of the key.
> Let me know if that would work for you?
> Regarding the TeeWriter, I would like to also get feedback on it, Ari?
> /Jerome.
> On 5/11/10 11:24 AM, "Corbin Hoenes" <> wrote:
>> Eric,
>> Thanks you guys are spot on with your analysis of our demux issue--right now
>> we have a single data type.  We can probably split that into two different
>> types later but still won't help much until the partitioning is either
>> pluggable or somewhat configurable as CHUKWA-481 states.
>> My questions about the Tee are more related to low latency requirements of
>> creating more realtime like feeds of our data.  My initial thought is that if
>> I could get data out of hadoop in 10 or 5 minute intervals that it might be
>> "good enough" for this so I was interested in speeding up demux a bit.  But
>> now I think the right thing will be using the Tee and getting the data into a
>> different system to create these feeds and let hadoop handle the large scale
>> analysis only.
>> The Tee seems perfect...will have to try it out hoping to get feedback from
>> people that are using it like this.  Sounds like Ari does.
>> On May 11, 2010, at 12:03 PM, Eric Yang wrote:
>>> Corbin,
>>> Multiple collectors will improve the mapper processing speed, but the
>>> reducer is still the long tail of the demux processing. It sounds like you
>>> have large amount of same type of data.  It will definitely speed up your
>>> processing once CHUKWA-481 is addressed.
>>> Regards,
>>> Eric 
>>> On 5/10/10 7:34 PM, "Corbin Hoenes" <> wrote:
>>>> We are processing apache log files.    The current scale is 70-80GB per
>>>> day...but we'd like it to have a story for scaling up to move. Just checking
>>>> my collector logs it appears the data rate is still ranges from 600KB-1.2
>>>> MB.
>>>> This is all from one collector.  Does your setup use multiple collectors?
>>>> My
>>>> thought is that multiple collectors could be used to scale out once we reach
>>>> a
>>>> data rate that caused issues for a single collector.
>>>> Any chance you know where that data rate is?
>>>> On May 10, 2010, at 5:37 PM, Ariel Rabkin wrote:
>>>>> That's how we use it at Berkeley, to process metrics from hundreds of
>>>>> machines; total data rate less than a megabyte per second, though.
>>>>> What scale of data are you looking at?
>>>>> The intent of SocketTee was if you need some subset of the data now,
>>>>> while write-to-HDFS-and-process-with-Hadoop is still the default path.
>>>>> What sort of low-latency processing do you need?
>>>>> --Ari
>>>>> On Mon, May 10, 2010 at 4:28 PM, Corbin Hoenes <>
>>>>>> Has anyone used the "Tee" in a larger scale deployment to try to
>>>>>> real-time/low latency data?  Interested in how feasible it would
be to use
>>>>>> it to pipe data into another system to handle these low latency requests
>>>>>> and
>>>>>> leave the long term analysis to hadoop.
>>>>> -- 
>>>>> Ari Rabkin
>>>>> UC Berkeley Computer Science Department

View raw message