Return-Path: Delivered-To: apmail-hadoop-chukwa-user-archive@minotaur.apache.org Received: (qmail 3310 invoked from network); 12 May 2010 20:14:37 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 May 2010 20:14:37 -0000 Received: (qmail 37828 invoked by uid 500); 12 May 2010 20:14:37 -0000 Delivered-To: apmail-hadoop-chukwa-user-archive@hadoop.apache.org Received: (qmail 37780 invoked by uid 500); 12 May 2010 20:14:36 -0000 Mailing-List: contact chukwa-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: chukwa-user@hadoop.apache.org Delivered-To: mailing list chukwa-user@hadoop.apache.org Received: (qmail 37772 invoked by uid 99); 12 May 2010 20:14:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 May 2010 20:14:36 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=AWL,HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 64.71.238.64 is neither permitted nor denied by domain of corbin@tynt.com) Received: from [64.71.238.64] (HELO sh2.exchange.ms) (64.71.238.64) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 May 2010 20:14:30 +0000 Received: from mse16fe1.mse16.exchange.ms (unknown [172.29.12.29]) by sh2.exchange.ms (Postfix) with ESMTP id 032D93D81B9 for ; Wed, 12 May 2010 16:10:48 -0400 (EDT) Received: from [192.168.16.103] ([173.14.233.73]) by mse16fe1.mse16.exchange.ms with Microsoft SMTPSVC(6.0.3790.3959); Wed, 12 May 2010 16:11:20 -0400 From: Corbin Hoenes Mime-Version: 1.0 (Apple Message framework v1078) Content-Type: multipart/alternative; boundary=Apple-Mail-16--964863226 Subject: Re: SocketTeeWriter Date: Wed, 12 May 2010 14:11:20 -0600 In-Reply-To: To: chukwa-user@hadoop.apache.org References: Message-Id: <9D31A8D9-41DC-4BD5-9C00-4F4E21314A51@tynt.com> X-Mailer: Apple Mail (2.1078) X-OriginalArrivalTime: 12 May 2010 20:11:20.0994 (UTC) FILETIME=[49CAD020:01CAF20F] --Apple-Mail-16--964863226 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Jerome, I would like to take a look at your partitioner if possible to see if = it'll work for us. I am not sure what would be best to partition on. = I am thinking a hash of the ChukwaArchiveKey.getTimePartition() would be = a decent partitioner--but I'm still a noob so not sure of the criteria = for a good paritioner. Did you just modify ChukwaRecordPartitioner? On May 11, 2010, at 12:37 PM, Jerome Boulon wrote: > Hey Corbin, >=20 > What kind of partitioner do you need? > I'm using one based on a hashing function of the key. > Let me know if that would work for you? >=20 > Regarding the TeeWriter, I would like to also get feedback on it, Ari? >=20 > /Jerome. >=20 > On 5/11/10 11:24 AM, "Corbin Hoenes" wrote: >=20 >> Eric, >>=20 >> Thanks you guys are spot on with your analysis of our demux = issue--right now >> we have a single data type. We can probably split that into two = different >> types later but still won't help much until the partitioning is = either >> pluggable or somewhat configurable as CHUKWA-481 states. >>=20 >> My questions about the Tee are more related to low latency = requirements of >> creating more realtime like feeds of our data. My initial thought is = that if >> I could get data out of hadoop in 10 or 5 minute intervals that it = might be >> "good enough" for this so I was interested in speeding up demux a = bit. But >> now I think the right thing will be using the Tee and getting the = data into a >> different system to create these feeds and let hadoop handle the = large scale >> analysis only. >>=20 >> The Tee seems perfect...will have to try it out hoping to get = feedback from >> people that are using it like this. Sounds like Ari does. >>=20 >> On May 11, 2010, at 12:03 PM, Eric Yang wrote: >>=20 >>> Corbin, >>>=20 >>> Multiple collectors will improve the mapper processing speed, but = the >>> reducer is still the long tail of the demux processing. It sounds = like you >>> have large amount of same type of data. It will definitely speed up = your >>> processing once CHUKWA-481 is addressed. >>>=20 >>> Regards, >>> Eric=20 >>>=20 >>>=20 >>> On 5/10/10 7:34 PM, "Corbin Hoenes" wrote: >>>=20 >>>> We are processing apache log files. The current scale is 70-80GB = per >>>> day...but we'd like it to have a story for scaling up to move. Just = checking >>>> my collector logs it appears the data rate is still ranges from = 600KB-1.2 >>>> MB. >>>> This is all from one collector. Does your setup use multiple = collectors? >>>> My >>>> thought is that multiple collectors could be used to scale out once = we reach >>>> a >>>> data rate that caused issues for a single collector. >>>>=20 >>>> Any chance you know where that data rate is? >>>>=20 >>>> On May 10, 2010, at 5:37 PM, Ariel Rabkin wrote: >>>>=20 >>>>> That's how we use it at Berkeley, to process metrics from hundreds = of >>>>> machines; total data rate less than a megabyte per second, though. >>>>> What scale of data are you looking at? >>>>>=20 >>>>> The intent of SocketTee was if you need some subset of the data = now, >>>>> while write-to-HDFS-and-process-with-Hadoop is still the default = path. >>>>> What sort of low-latency processing do you need? >>>>>=20 >>>>> --Ari >>>>>=20 >>>>> On Mon, May 10, 2010 at 4:28 PM, Corbin Hoenes = wrote: >>>>>> Has anyone used the "Tee" in a larger scale deployment to try to = get >>>>>> real-time/low latency data? Interested in how feasible it would = be to use >>>>>> it to pipe data into another system to handle these low latency = requests >>>>>> and >>>>>> leave the long term analysis to hadoop. >>>>>>=20 >>>>>>=20 >>>>>=20 >>>>>=20 >>>>>=20 >>>>> --=20 >>>>> Ari Rabkin asrabkin@gmail.com >>>>> UC Berkeley Computer Science Department >>>>=20 >>>=20 >>=20 >>=20 >=20 --Apple-Mail-16--964863226 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii ChukwaRecordPartitioner?

On May 11, 2010, at 12:37 = PM, Jerome Boulon wrote:

Hey = Corbin,

What kind of partitioner do you need?
I'm using one = based on a hashing function of the key.
Let me know if that would = work for you?

Regarding the TeeWriter, I would like to also get = feedback on it, Ari?

/Jerome.

On 5/11/10 11:24 AM, "Corbin = Hoenes" <corbin@tynt.com> = wrote:

Eric,

Thanks you guys = are spot on with your analysis of our demux issue--right = now
we have a single data = type.  We can probably split that into two = different
types later but = still won't help much until the partitioning is = either
pluggable or somewhat = configurable as CHUKWA-481 states.

My questions = about the Tee are more related to low latency requirements = of
creating more realtime like = feeds of our data.  My initial thought is that = if
I could get data out of = hadoop in 10 or 5 minute intervals that it might = be
"good enough" for this so I = was interested in speeding up demux a bit. =  But
now I think the = right thing will be using the Tee and getting the data into = a
different system to create = these feeds and let hadoop handle the large = scale
analysis = only.

The Tee seems = perfect...will have to try it out hoping to get feedback = from
people that are using it = like this.  Sounds like Ari does.

On May 11, = 2010, at 12:03 PM, Eric Yang wrote:

Corbin,

Multiple collectors will improve = the mapper processing speed, but = the
reducer is still the long tail of the demux processing. It = sounds like you
have large amount of same type = of data.  It will definitely speed up = your
processing once CHUKWA-481 is = addressed.

Regards,
Eric =


On 5/10/10 7:34 PM, "Corbin = Hoenes" <corbin@tynt.com> = wrote:

We are = processing apache log files.    The current scale is = 70-80GB per
day...but we'd like it to have a story for scaling up to = move. Just = checking
my = collector logs it appears the data rate is still ranges from = 600KB-1.2
MB.
This = is all from one collector.  Does your setup use multiple = collectors?
My
thought = is that multiple collectors could be used to scale out once we = reach
a
data = rate that caused issues for a single = collector.

Any = chance you know where that data rate = is?

On May = 10, 2010, at 5:37 PM, Ariel Rabkin = wrote:

That's how we use it at = Berkeley, to process metrics from hundreds = of
machines; total data rate less = than a megabyte per second, = though.
What scale of data are you = looking = at?

The intent of SocketTee was if = you need some subset of the data = now,
while = write-to-HDFS-and-process-with-Hadoop is still the default = path.
What sort of low-latency = processing do you = need?

--Ari

On Mon, May 10, 2010 at 4:28 PM, = Corbin Hoenes <corbin@tynt.com> = wrote:
Has = anyone used the "Tee" in a larger scale deployment to try to = get
real-time/low latency data?  Interested in how = feasible it would be to = use
it to = pipe data into another system to handle these low latency = requests
and
<= /blockquote>
leave the long term analysis to = hadoop.





-- =
Ari Rabkin asrabkin@gmail.com
=
UC Berkeley Computer Science = Department






= --Apple-Mail-16--964863226--