Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E8AB8D7F5 for ; Sat, 20 Oct 2012 03:27:59 +0000 (UTC) Received: (qmail 40802 invoked by uid 500); 20 Oct 2012 03:27:55 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 40539 invoked by uid 500); 20 Oct 2012 03:27:54 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 40509 invoked by uid 99); 20 Oct 2012 03:27:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 20 Oct 2012 03:27:53 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of linlma@gmail.com designates 209.85.220.176 as permitted sender) Received: from [209.85.220.176] (HELO mail-vc0-f176.google.com) (209.85.220.176) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 20 Oct 2012 03:27:46 +0000 Received: by mail-vc0-f176.google.com with SMTP id gb22so1407124vcb.35 for ; Fri, 19 Oct 2012 20:27:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=cDxtvqdwqaExcWXZ2f0znukjI3eY+PZNAtlPnRsDFmY=; b=AW8CE3KBAnwPF+o2NEMVj7v+qKYLCt94+PpMYpq7q9RZuBJVljR2PNijGbSPjkb3uw L31RQL5mCrBWkTa2emVyA4XKsuy+fWd22iAmXvsw91U/k7++dNt2YbQJS5MopwLTxgWk jgwXPTgClno2ryKTvge6Wi0TA6jjz9aaRWE6VO0sd23kuzUL0WnobbxoxQTNTaSrtA73 YhWMC/XjhOu3cB2xFVqp44uYE/qkBa0TJ/t35rRz1J0538ppzrDBzRijxmWII+DyjoQT xpWJRuJl0/wdmGQl3x+QH0oMQYg8F7NxtbF0KeRpm/+IORuKWanJ5aFbbVOPckAUK8V0 Oz4A== MIME-Version: 1.0 Received: by 10.52.35.15 with SMTP id d15mr3280223vdj.128.1350703645697; Fri, 19 Oct 2012 20:27:25 -0700 (PDT) Received: by 10.58.189.228 with HTTP; Fri, 19 Oct 2012 20:27:25 -0700 (PDT) In-Reply-To: References: Date: Sat, 20 Oct 2012 11:27:25 +0800 Message-ID: Subject: Re: Hadoop counter From: Lin Ma To: user@hadoop.apache.org, michael_segel@hotmail.com Content-Type: multipart/alternative; boundary=20cf307ac79bffa6a204cc753244 X-Virus-Checked: Checked by ClamAV on apache.org --20cf307ac79bffa6a204cc753244 Content-Type: text/plain; charset=ISO-8859-1 Thanks for the detailed reply Mike, I learned a lot from the discussion. - I just want to confirm with you that, supposing in the same job, when a specific task completed (and counter is aggregated in JT after the task completed from our discussion?), the other running task in the same job cannot get the updated counter value from the previous completed task? I am asking this because I am thinking whether I can use counter to share a global value between tasks. - If so, what is the traditional use case of counter, only use counter values after the whole job completes? BTW: appreciate if you could share me a few use cases from your experience about how counters are used. regards, Lin On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel wrote: > Yeah, sorry... > > I meant that if you were dynamically creating a counter foo in the Mapper > task, then each mapper would be creating their own counter foo. > As the job runs, these counters will eventually be sent up to the JT. The > job tracker would keep a separate counter for each task. > > At the end, the final count is aggregated from the list of counters for > foo. > > > I don't know how you can get a task to ask information from the Job > Tracker on how things are going in other tasks. That is what I meant that > you couldn't get information about the other counters or even the status of > the other tasks running in the same job. > > I didn't see anything in the APIs that allowed for that type of flow... Of > course having said that... someone pops up with a way to do just that. ;-) > > > Does that clarify things? > > -Mike > > > On Oct 19, 2012, at 11:56 AM, Lin Ma wrote: > > Hi Mike, > > Sorry I am a bit lost... As you are thinking faster than me. :-P > > From your this statement "It would make sense that the JT maintains a > unique counter for each task until the tasks complete." -- it seems each > task cannot see counters from each other, since JT maintains a unique > counter for each tasks; > > From your this comment "I meant that if a Task created and updated a > counter, a different Task has access to that counter. " -- it seems > different tasks could share/access the same counter. > > Appreciate if you could help to clarify a bit. > > regards, > Lin > > On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel > wrote: > >> >> On Oct 19, 2012, at 11:27 AM, Lin Ma wrote: >> >> Hi Mike, >> >> Thanks for the detailed reply. Two quick questions/comments, >> >> 1. For "task", you mean a specific mapper instance, or a specific reducer >> instance? >> >> >> Either. >> >> 2. "However, I do not believe that a separate Task could connect with the >> JT and see if the counter exists or if it could get a value or even an >> accurate value since the updates are asynchronous." -- do you mean if a >> mapper is updating custom counter ABC, and another mapper is updating the >> same customer counter ABC, their counter values are updated independently >> by different mappers, and will not published (aggregated) externally until >> job completed successfully? >> >> I meant that if a Task created and updated a counter, a different Task >> has access to that counter. >> >> To give you an example, if I want to count the number of quality errors >> and then fail after X number of errors, I can't use Global counters to do >> this. >> >> regards, >> Lin >> >> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel < >> michael_segel@hotmail.com> wrote: >> >>> As I understand it... each Task has its own counters and are >>> independently updated. As they report back to the JT, they update the >>> counter(s)' status. >>> The JT then will aggregate them. >>> >>> In terms of performance, Counters take up some memory in the JT so while >>> its OK to use them, if you abuse them, you can run in to issues. >>> As to limits... I guess that will depend on the amount of memory on the >>> JT machine, the size of the cluster (Number of TT) and the number of >>> counters. >>> >>> In terms of global accessibility... Maybe. >>> >>> The reason I say maybe is that I'm not sure by what you mean by globally >>> accessible. >>> If a task creates and implements a dynamic counter... I know that it >>> will eventually be reflected in the JT. However, I do not believe that a >>> separate Task could connect with the JT and see if the counter exists or if >>> it could get a value or even an accurate value since the updates are >>> asynchronous. Not to mention that I don't believe that the counters are >>> aggregated until the job ends. It would make sense that the JT maintains a >>> unique counter for each task until the tasks complete. (If a task fails, it >>> would have to delete the counters so that when the task is restarted the >>> correct count is maintained. ) Note, I haven't looked at the source code >>> so I am probably wrong. >>> >>> HTH >>> Mike >>> On Oct 19, 2012, at 5:50 AM, Lin Ma wrote: >>> >>> Hi guys, >>> >>> I have some quick questions regarding to Hadoop counter, >>> >>> >>> - Hadoop counter (customer defined) is global accessible (for both >>> read and write) for all Mappers and Reducers in a job? >>> - What is the performance and best practices of using Hadoop >>> counters? I am not sure if using Hadoop counters too heavy, there will be >>> performance downgrade to the whole job? >>> >>> regards, >>> Lin >>> >>> >>> >> >> > > --20cf307ac79bffa6a204cc753244 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks for the detailed reply Mike, I learned a lot from the discussion.
- I just want to confirm with you that, supposing in the same job, whe= n a specific task completed (and counter is aggregated in JT after the task= completed from our discussion?), the other running task in the same job ca= nnot get the updated counter value from the previous completed task? I am a= sking this because I am thinking whether I can use counter to share a globa= l value between tasks.
- If so, what is the traditional use case of counter, only use counter valu= es after the whole job completes?

BTW: appreciate if you could share= me a few use cases from your experience about how counters are used.

regards,
Lin

On Sat, Oct 20, 2012 = at 5:05 AM, Michael Segel <michael_segel@hotmail.com> wrote:
Yeah, so= rry...=A0

I meant that if you were dynamically creating = a counter foo in the Mapper task, then each mapper would be creating their = own counter foo.=A0
As the job runs, these counters will eventually be sent up to the JT. = The job tracker would keep a separate counter for each task.=A0
<= br>
At the end, the final count is aggregated from the list of co= unters for foo.=A0


I don't know how you can get a task = to ask information from the Job Tracker on how things are going in other ta= sks. =A0That is what I meant that you couldn't get information about th= e other counters or even the status of the other tasks running in the same = job.=A0

I didn't see anything in the APIs that allowed for = that type of flow... Of course having said that... someone pops up with a w= ay to do just that. ;-)=A0


Does tha= t clarify things?=A0

-Mike

<= br>
On Oct 19, 2012, at 11:56 AM, Lin Ma <linlma@gmail.com> wrote:

<= blockquote type=3D"cite"> Hi Mike,

Sorry I am a bit lost... As you are thinking faster than me= . :-P

From your this statement "It would make sense that the JT= maintains a unique counter for each task until the tasks complete." -= - it seems each task cannot see counters from each other, since JT maintain= s a unique counter for each tasks;

From your this comment "I meant that if a Task created and updated= a counter, a different Task has access to that counter. " -- it seems= different tasks could share/access the same counter.

Appreciate if = you could help to clarify a bit.

regards,
Lin

On Sat, Oct 20, 2012 = at 12:42 AM, Michael Segel <michael_segel@hotmail.com> wrote:

On Oct 19, 2012, at 11:27 AM, Lin Ma <linlma@gmail.com> wrote:

Hi Mike,

Thanks for the detailed reply= . Two quick questions/comments,

1. For "task", you mean a = specific mapper instance, or a specific reducer instance?

Either.=A0

2. "However, I do not believe that a separate Task could connect wit= h the JT and see if the counter exists or if it could get a value or even an=20 accurate value since the updates are asynchronous." -- do you mean if = a mapper is updating custom counter ABC, and another mapper is updating the= same customer counter ABC, their counter values are updated independently = by different mappers, and will not published (aggregated) externally until = job completed successfully?

I meant that if a Task created and updated a counter= , a different Task has access to that counter.=A0

= To give you an example, if I want to count the number of quality errors and= then fail after X number of errors, I can't use Global counters to do = this.

regards,
Lin

On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <michael_segel@hotmail.com> wrote:
As I und= erstand it... each Task has its own counters and are independently updated.= As they report back to the JT, they update the counter(s)' status. The JT then will aggregate them.=A0

In terms of pe= rformance, Counters take up some memory in the JT so while its OK to use th= em, if you abuse them, you can run in to issues.=A0
As to limits.= .. I guess that will depend on the amount of memory on the JT machine, the = size of the cluster (Number of TT) and the number of counters.=A0

In terms of global accessibility... Maybe.
The reason I say maybe is that I'm not sure by what you me= an by globally accessible.=A0
If a task creates and implements a = dynamic counter... I know that it will eventually be reflected in the JT. H= owever, I do not believe that a separate Task could connect with the JT and= see if the counter exists or if it could get a value or even an accurate v= alue since the updates are asynchronous. =A0Not to mention that I don't= believe that the counters are aggregated until the job ends. It would make= sense that the JT maintains a unique counter for each task until the tasks= complete. (If a task fails, it would have to delete the counters so that w= hen the task is restarted the correct count is maintained. ) =A0Note, I hav= en't looked at the source code so I am probably wrong.=A0

HTH
Mike
On Oct 19, 2012,= at 5:50 AM, Lin Ma <linlma@gmail.com> wrote:

Hi guys,

I have some quick questions regarding to Hadoop counter,
  • Hadoop counter (customer defined) is global accessible (for bo= th read and write) for all Mappers and Reducers in a job?
  • What is t= he performance and best practices of using Hadoop counters? I am not sure i= f using Hadoop counters too heavy, there will be performance downgrade to t= he whole job?
regards,
Lin






--20cf307ac79bffa6a204cc753244--