incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Rabkin <asrab...@gmail.com>
Subject Re: SocketTeeWriter
Date Wed, 12 May 2010 20:30:21 GMT
There are basically three requirement for a partitioner:

- It should be deterministic:  within a given job, the same key goes
to the same reducer every time.
- It should distribute your data into roughly equally sized bins.
Hash of time should work fine and dandy.
- It should group together data that the same reducer should see.

There's often a tension between #2 and #3.  The Chukwa default
emphasizes #3; it sounds like #2 is primary for you.
Your suggestion of doing it by timestamp should work fine and dandy if
you don't need to compare across records in demux.
I think it should be straightforward for you to modify
ChukwaRecordPartitioner. If you find a way to contribute
code back, that'd be great, of course.

--Ari

On Wed, May 12, 2010 at 1:11 PM, Corbin Hoenes <corbin@tynt.com> wrote:
> Jerome,
> I would like to take a look at your partitioner if possible to see if it'll
> work for us.   I am not sure what would be best to partition on.  I am
> thinking a hash of the ChukwaArchiveKey.getTimePartition() would be a decent
> partitioner--but I'm still a noob so not sure of the criteria for a good
> paritioner.
> Did you just modify ChukwaRecordPartitioner?
> On May 11, 2010, at 12:37 PM, Jerome Boulon wrote:
>
> Hey Corbin,
>
> What kind of partitioner do you need?
> I'm using one based on a hashing function of the key.
> Let me know if that would work for you?
>
> Regarding the TeeWriter, I would like to also get feedback on it, Ari?
>
> /Jerome.
>
> On 5/11/10 11:24 AM, "Corbin Hoenes" <corbin@tynt.com> wrote:
>
> Eric,
>
> Thanks you guys are spot on with your analysis of our demux issue--right now
>
> we have a single data type.  We can probably split that into two different
>
> types later but still won't help much until the partitioning is either
>
> pluggable or somewhat configurable as CHUKWA-481 states.
>
> My questions about the Tee are more related to low latency requirements of
>
> creating more realtime like feeds of our data.  My initial thought is that
> if
>
> I could get data out of hadoop in 10 or 5 minute intervals that it might be
>
> "good enough" for this so I was interested in speeding up demux a bit.  But
>
> now I think the right thing will be using the Tee and getting the data into
> a
>
> different system to create these feeds and let hadoop handle the large scale
>
> analysis only.
>
> The Tee seems perfect...will have to try it out hoping to get feedback from
>
> people that are using it like this.  Sounds like Ari does.
>
> On May 11, 2010, at 12:03 PM, Eric Yang wrote:
>
> Corbin,
>
> Multiple collectors will improve the mapper processing speed, but the
>
> reducer is still the long tail of the demux processing. It sounds like you
>
> have large amount of same type of data.  It will definitely speed up your
>
> processing once CHUKWA-481 is addressed.
>
> Regards,
>
> Eric
>
>
> On 5/10/10 7:34 PM, "Corbin Hoenes" <corbin@tynt.com> wrote:
>
> We are processing apache log files.    The current scale is 70-80GB per
>
> day...but we'd like it to have a story for scaling up to move. Just checking
>
> my collector logs it appears the data rate is still ranges from 600KB-1.2
>
> MB.
>
> This is all from one collector.  Does your setup use multiple collectors?
>
> My
>
> thought is that multiple collectors could be used to scale out once we reach
>
> a
>
> data rate that caused issues for a single collector.
>
> Any chance you know where that data rate is?
>
> On May 10, 2010, at 5:37 PM, Ariel Rabkin wrote:
>
> That's how we use it at Berkeley, to process metrics from hundreds of
>
> machines; total data rate less than a megabyte per second, though.
>
> What scale of data are you looking at?
>
> The intent of SocketTee was if you need some subset of the data now,
>
> while write-to-HDFS-and-process-with-Hadoop is still the default path.
>
> What sort of low-latency processing do you need?
>
> --Ari
>
> On Mon, May 10, 2010 at 4:28 PM, Corbin Hoenes <corbin@tynt.com> wrote:
>
> Has anyone used the "Tee" in a larger scale deployment to try to get
>
> real-time/low latency data?  Interested in how feasible it would be to use
>
> it to pipe data into another system to handle these low latency requests
>
> and
>
> leave the long term analysis to hadoop.
>
>
>
>
>
> --
>
> Ari Rabkin asrabkin@gmail.com
>
> UC Berkeley Computer Science Department
>
>
>
>
>
>
>



-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department

Mime
View raw message