Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6B7FFD5BE for ; Tue, 23 Oct 2012 07:13:00 +0000 (UTC) Received: (qmail 59357 invoked by uid 500); 23 Oct 2012 07:12:55 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 59291 invoked by uid 500); 23 Oct 2012 07:12:54 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 59266 invoked by uid 99); 23 Oct 2012 07:12:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Oct 2012 07:12:54 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of linlma@gmail.com designates 209.85.220.176 as permitted sender) Received: from [209.85.220.176] (HELO mail-vc0-f176.google.com) (209.85.220.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Oct 2012 07:12:47 +0000 Received: by mail-vc0-f176.google.com with SMTP id gb22so879339vcb.35 for ; Tue, 23 Oct 2012 00:12:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ygQ4kzvoTSU855vvDMx6AgR/EYKvbooZ/zXGcFJ3ILU=; b=TWo+NlMAGsxnbS43ZpvRCY1cSdg+FYtm22TADPzoNaRUTMil7UoCAPgP9KNq43gKLq awNqSTsSWxA6/TZFNZtWqwQZzabRgbgX7k80ZzrT4EA9szo0XqdkGBCEz3kBqbWtsz4W fpTt4EeD0p7hooLtylYB8yy/pYPZ+PK5f+4y9eypoqzCC1R7Q6mTuWdU2clZEB/6pM7p AUsvYZ25+8O/XKGU8u7ga5eGHRE1vDLfSACWLEf+fS1lPXFhW1d9PZAyPRmPYXbMvbrC vbOVDYb7w8I8EqT3tnWaIQje5q8LTak/EbxP9sRPr5hUo6wbEUZeDcOe6UfxN5vKz0e1 U5DQ== MIME-Version: 1.0 Received: by 10.52.27.82 with SMTP id r18mr15424145vdg.120.1350976346011; Tue, 23 Oct 2012 00:12:26 -0700 (PDT) Received: by 10.58.189.232 with HTTP; Tue, 23 Oct 2012 00:12:25 -0700 (PDT) In-Reply-To: References: Date: Tue, 23 Oct 2012 15:12:25 +0800 Message-ID: Subject: Re: Hadoop counter From: Lin Ma To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=20cf307c9b3e34376b04ccb4b1a8 X-Virus-Checked: Checked by ClamAV on apache.org --20cf307c9b3e34376b04ccb4b1a8 Content-Type: text/plain; charset=ISO-8859-1 Thanks for the long discussion Mile. Learned a lot from you. regards, Lin On Tue, Oct 23, 2012 at 11:57 AM, Michael Segel wrote: > Yup. > The counters at the end of the job are the most accurate. > > On Oct 22, 2012, at 3:00 AM, Lin Ma wrote: > > Thanks for the help so much, Mike. I learned a lot from this discussion. > > So, the conclusion I learned from the discussion should be, since how/when > JT merge counter in the middle of the process of a job is undefined and > internal behavior, it is more reliable to read counter after the whole job > completes? Agree? > > regards, > Lin > > On Sun, Oct 21, 2012 at 8:15 PM, Michael Segel wrote: > >> >> On Oct 21, 2012, at 1:45 AM, Lin Ma wrote: >> >> Thanks for the detailed reply, Mike. Yes, my most confusion is resolved >> by you. The last two questions (or comments) are used to confirm my >> understanding is correct, >> >> - is it normal use case or best practices for a job to consume/read the >> counters from previous completed job in an automatic way? I ask this >> because I am not sure whether the most use case of counter is human read >> and manual analysis, other then using another job to automatic consume the >> counters? >> >> >> Lin, >> Every job has a set of counters to maintain job statistics. >> This is specifically for human analysis and to help understand what >> happened with your job. >> It allows you to see how much data is read in by the job, how many >> records processed to be measured against how long the job took to complete. >> It also showed you how much data is written back out. >> >> In addition to this, a set of use cases for counters in Hadoop center on >> quality control. Its normal to chain jobs together to form a job flow. >> A typical use case for Hadoop is to pull data from various sources, >> combine them and do some process on them, resulting in a data set that gets >> sent to another system for visualization. >> >> In this use case, there are usually data cleansing and validation jobs. >> As they run, its possible to track a number of defective records. At the >> end of that specific job, from the ToolRunner, or whichever job class you >> used to launch your job, you can then get these aggregated counters for the >> job and determine if the process passed or failed. Based on this, you can >> exit your program with either a success or failed flag. Job Flow control >> tools like Oozie can capture this and then decide to continue or to stop >> and alert an operator of an error. >> >> - I want to confirm my understanding is correct, when each task >> completes, JT will aggregate/update the global counter values from the >> specific counter values updated by the complete task, but never expose >> global counters values until job completes? If it is correct, I am >> wondering why JT doing aggregation each time when a task completes, other >> than doing a one time aggregation when the job completes? Is there any >> design choice reasons? thanks. >> >> >> That's a good question. I haven't looked at the code, so I can't say >> definitively when the JT performs its aggregation. However, as the job runs >> and in process, we can look at the job tracker web page(s) and see the >> counter summary. This would imply that there has to be some aggregation >> occurring mid-flight. (It would be trivial to sum the list of counters >> periodically to update the job statistics.) Note too that if the JT web >> pages can show a counter, its possible to then write a monitoring tool that >> can monitor the job while running and then kill the job mid flight if a >> certain threshold of a counter is met. >> >> That is to say you could in theory write a monitoring process and watch >> the counters. If lets say an error counter hits a predetermined threshold, >> you could then issue a 'hadoop job -kill ' command. >> >> >> regards, >> Lin >> >> On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel > > wrote: >> >>> >>> On Oct 19, 2012, at 10:27 PM, Lin Ma wrote: >>> >>> Thanks for the detailed reply Mike, I learned a lot from the discussion. >>> >>> - I just want to confirm with you that, supposing in the same job, when >>> a specific task completed (and counter is aggregated in JT after the task >>> completed from our discussion?), the other running task in the same job >>> cannot get the updated counter value from the previous completed task? I am >>> asking this because I am thinking whether I can use counter to share a >>> global value between tasks. >>> >>> >>> Yes that is correct. >>> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy >>> way for a task to query the job tracker. This might have changed in YARN >>> >>> - If so, what is the traditional use case of counter, only use counter >>> values after the whole job completes? >>> >>> Yes the counters are used to provide data at the end of the job... >>> >>> BTW: appreciate if you could share me a few use cases from your >>> experience about how counters are used. >>> >>> Well you have your typical job data like the number of records >>> processed, total number of bytes read, bytes written... >>> >>> But suppose you wanted to do some quality control on your input. >>> So you need to keep a track on the count of bad records. If this job is >>> part of a process, you may want to include business logic in your job to >>> halt the job flow if X% of the records contain bad data. >>> >>> Or your process takes input records and in processing them, they sort >>> the records based on some characteristic and you want to count those sorted >>> records as you processed them. >>> >>> For a more concrete example, the Illinois Tollway has these 'fast pass' >>> lanes where cars equipped with RFID tags can have the tolls automatically >>> deducted from their accounts rather than pay the toll manually each time. >>> >>> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes >>> are cheaters where they drive through the sensor and the sensor doesn't >>> capture the RFID tag. (Note its possible that you have a false positive >>> where the car has an RFID chip but doesn't trip the sensor.) Pushing the >>> data in a map/reduce job would require the use of counters. >>> >>> Does that help? >>> >>> -Mike >>> >>> regards, >>> Lin >>> >>> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel < >>> michael_segel@hotmail.com> wrote: >>> >>>> Yeah, sorry... >>>> >>>> I meant that if you were dynamically creating a counter foo in the >>>> Mapper task, then each mapper would be creating their own counter foo. >>>> As the job runs, these counters will eventually be sent up to the JT. >>>> The job tracker would keep a separate counter for each task. >>>> >>>> At the end, the final count is aggregated from the list of counters for >>>> foo. >>>> >>>> >>>> I don't know how you can get a task to ask information from the Job >>>> Tracker on how things are going in other tasks. That is what I meant that >>>> you couldn't get information about the other counters or even the status of >>>> the other tasks running in the same job. >>>> >>>> I didn't see anything in the APIs that allowed for that type of flow... >>>> Of course having said that... someone pops up with a way to do just that. >>>> ;-) >>>> >>>> >>>> Does that clarify things? >>>> >>>> -Mike >>>> >>>> >>>> On Oct 19, 2012, at 11:56 AM, Lin Ma wrote: >>>> >>>> Hi Mike, >>>> >>>> Sorry I am a bit lost... As you are thinking faster than me. :-P >>>> >>>> From your this statement "It would make sense that the JT maintains a >>>> unique counter for each task until the tasks complete." -- it seems each >>>> task cannot see counters from each other, since JT maintains a unique >>>> counter for each tasks; >>>> >>>> From your this comment "I meant that if a Task created and updated a >>>> counter, a different Task has access to that counter. " -- it seems >>>> different tasks could share/access the same counter. >>>> >>>> Appreciate if you could help to clarify a bit. >>>> >>>> regards, >>>> Lin >>>> >>>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel < >>>> michael_segel@hotmail.com> wrote: >>>> >>>>> >>>>> On Oct 19, 2012, at 11:27 AM, Lin Ma wrote: >>>>> >>>>> Hi Mike, >>>>> >>>>> Thanks for the detailed reply. Two quick questions/comments, >>>>> >>>>> 1. For "task", you mean a specific mapper instance, or a specific >>>>> reducer instance? >>>>> >>>>> >>>>> Either. >>>>> >>>>> 2. "However, I do not believe that a separate Task could connect with >>>>> the JT and see if the counter exists or if it could get a value or even an >>>>> accurate value since the updates are asynchronous." -- do you mean if a >>>>> mapper is updating custom counter ABC, and another mapper is updating the >>>>> same customer counter ABC, their counter values are updated independently >>>>> by different mappers, and will not published (aggregated) externally until >>>>> job completed successfully? >>>>> >>>>> I meant that if a Task created and updated a counter, a different Task >>>>> has access to that counter. >>>>> >>>>> To give you an example, if I want to count the number of quality >>>>> errors and then fail after X number of errors, I can't use Global counters >>>>> to do this. >>>>> >>>>> regards, >>>>> Lin >>>>> >>>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel < >>>>> michael_segel@hotmail.com> wrote: >>>>> >>>>>> As I understand it... each Task has its own counters and are >>>>>> independently updated. As they report back to the JT, they update the >>>>>> counter(s)' status. >>>>>> The JT then will aggregate them. >>>>>> >>>>>> In terms of performance, Counters take up some memory in the JT so >>>>>> while its OK to use them, if you abuse them, you can run in to issues. >>>>>> As to limits... I guess that will depend on the amount of memory on >>>>>> the JT machine, the size of the cluster (Number of TT) and the number of >>>>>> counters. >>>>>> >>>>>> In terms of global accessibility... Maybe. >>>>>> >>>>>> The reason I say maybe is that I'm not sure by what you mean by >>>>>> globally accessible. >>>>>> If a task creates and implements a dynamic counter... I know that it >>>>>> will eventually be reflected in the JT. However, I do not believe that a >>>>>> separate Task could connect with the JT and see if the counter exists or if >>>>>> it could get a value or even an accurate value since the updates are >>>>>> asynchronous. Not to mention that I don't believe that the counters are >>>>>> aggregated until the job ends. It would make sense that the JT maintains a >>>>>> unique counter for each task until the tasks complete. (If a task fails, it >>>>>> would have to delete the counters so that when the task is restarted the >>>>>> correct count is maintained. ) Note, I haven't looked at the source code >>>>>> so I am probably wrong. >>>>>> >>>>>> HTH >>>>>> Mike >>>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma wrote: >>>>>> >>>>>> Hi guys, >>>>>> >>>>>> I have some quick questions regarding to Hadoop counter, >>>>>> >>>>>> >>>>>> - Hadoop counter (customer defined) is global accessible (for >>>>>> both read and write) for all Mappers and Reducers in a job? >>>>>> - What is the performance and best practices of using Hadoop >>>>>> counters? I am not sure if using Hadoop counters too heavy, there will be >>>>>> performance downgrade to the whole job? >>>>>> >>>>>> regards, >>>>>> Lin >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > > --20cf307c9b3e34376b04ccb4b1a8 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks for the long discussion Mile. Learned a lot from you.

regards= ,
Lin

On Tue, Oct 23, 2012 at 11:57 AM= , Michael Segel <michael_segel@hotmail.com> wrote:
Yup.=A0<= div>The counters at the end of the job are the most accurate.=A0
=

On Oct 22, 2012, at 3:00 AM, Lin Ma <linlma@gmail.com> wrote:
=
Thanks for the help so much, Mike. I learned = a lot from this discussion.

So, the conclusion I learned from the discussion should be, since how/w= hen JT merge counter in the middle of the process of a job is undefined and= internal behavior, it is more reliable to read counter after the whole job= completes? Agree?

regards,
Lin

On Sun, Oct 21, 2012 = at 8:15 PM, Michael Segel <michael_segel@hotmail.com> wrote:

On Oct 21, 2012, at 1:45 AM, Lin Ma <linlma@gmail.com> wrote:

Thanks for the detailed reply, Mike. Yes, my = most confusion is resolved by you. The last two questions (or comments) are= used to confirm my understanding is correct,

- is it normal use cas= e or best practices for a job to consume/read the counters from previous co= mpleted job in an automatic way? I ask this because I am not sure whether t= he most use case of counter is human read and manual analysis, other then u= sing another job to automatic consume the counters?

Lin,=A0
Every job has a set of = counters to maintain job statistics.=A0
This is specifically for = human analysis and to help understand what happened with your job.=A0
=
It allows you to see how much data is read in by the job, how many records = processed to be measured against how long the job took to complete. =A0It a= lso showed you how much data is written back out. =A0

In addition to this, =A0a set of use cases for counters in Hadoop center on= quality control. Its normal to chain jobs together to form a job flow.=A0<= /div>
A typical use case for Hadoop is to pull data from various source= s, combine them and do some process on them, resulting in a data set that g= ets sent to another system for visualization.=A0

In this use case, there are usually data cleansing and = validation jobs. As they run, its possible to track a number of defective r= ecords. At the end of that specific job, from the ToolRunner, or whichever = job class you used to launch your job, you can then get these aggregated co= unters for the job and determine if the process passed or failed. =A0Based = on this, you can exit your program with either a success or failed flag. = =A0Job Flow control tools like Oozie can capture this and then decide to co= ntinue or to stop and alert an operator of an error.=A0

- I want to confirm my understanding is correct, when each task completes, = JT will aggregate/update the global counter values from the specific counte= r values updated by the complete task, but never expose global counters val= ues until job completes? If it is correct, I am wondering why JT doing aggr= egation each time when a task completes, other than doing a one time aggreg= ation when the job completes? Is there any design choice reasons? thanks.

That's a good question. I haven't= looked at the code, so I can't say definitively when the JT performs i= ts aggregation. However, as the job runs and in process, we can look at the= job tracker web page(s) and see the counter summary. This would imply that= there has to be some aggregation occurring mid-flight. (It would be trivia= l to sum the list of counters periodically to update the job statistics.) = =A0Note too that if the JT web pages can show a counter, its possible to th= en write a monitoring tool that can monitor the job while running and then = kill the job mid flight if a certain threshold of a counter is met.=A0

That is to say you could in theory write a monitoring p= rocess and watch the counters. If lets say an error counter hits a predeter= mined threshold, you could then issue a 'hadoop job -kill <job-id>= ;' command.=A0


regards,
Lin

On Sat, Oct 20, 2012 = at 3:12 PM, Michael Segel <michael_segel@hotmail.com> wrote:

On Oct 19, 2012, at 10:27 PM, Lin Ma <linlma@gmail.com> wrote:

Thanks for the detailed reply Mike, I learned= a lot from the discussion.

- I just want to confirm with you that, = supposing in the same job, when a specific task completed (and counter is a= ggregated in JT after the task completed from our discussion?), the other r= unning task in the same job cannot get the updated counter value from the p= revious completed task? I am asking this because I am thinking whether I ca= n use counter to share a global value between tasks.

Yes that is correct.=A0
While I= haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy wa= y for a task to query the job tracker. This might have changed in YARN

- If so, what is the traditional use case of counter, only use counter valu= es after the whole job completes?

Yes the counter= s are used to provide data at the end of the job...=A0

BTW: appreciate if you could share me a few use c= ases from your experience about how counters are used.

Well you have your typical job data like the number = of records processed, total number of bytes read, =A0bytes written...=A0

But suppose you wanted to do some quality control on= your input.=A0
So you need to keep a track on the count of bad records. =A0If this jo= b is part of a process, you may want to include business logic in your job = to halt the job flow if X% of the records contain bad data.=A0
Or your process takes input records and in processing them, they= sort the records based on some characteristic and you want to count those = sorted records as you processed them.=A0

For a mor= e concrete example, the Illinois Tollway has these 'fast pass' lane= s where cars equipped with RFID tags can have the tolls automatically deduc= ted from their accounts rather than pay the toll manually each time.=A0

Suppose we wanted to determine how many cars in the = 9;Fast Pass' lanes are cheaters where they drive through the sensor and= the sensor doesn't capture the RFID tag. (Note its possible that you h= ave a false positive where the car has an RFID chip but doesn't trip th= e sensor.) Pushing the data in a map/reduce job would require the use of co= unters.

Does that help?=A0

-Mike
=

regards,
Lin

On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <michael_segel@hotmail.com> wrote:
Yeah, so= rry...=A0

I meant that if you were dynamically creating = a counter foo in the Mapper task, then each mapper would be creating their = own counter foo.=A0
As the job runs, these counters will eventually be sent up to the JT. = The job tracker would keep a separate counter for each task.=A0
<= br>
At the end, the final count is aggregated from the list of co= unters for foo.=A0


I don't know how you can get a task = to ask information from the Job Tracker on how things are going in other ta= sks. =A0That is what I meant that you couldn't get information about th= e other counters or even the status of the other tasks running in the same = job.=A0

I didn't see anything in the APIs that allowed for = that type of flow... Of course having said that... someone pops up with a w= ay to do just that. ;-)=A0


Does tha= t clarify things?=A0

-Mike


On Oc= t 19, 2012, at 11:56 AM, Lin Ma <linlma@gmail.com> wrote:

Hi Mike,

Sorry I am a bit lost... As you are thinking faster than me= . :-P

From your this statement "It would make sense that the JT= maintains a unique counter for each task until the tasks complete." -= - it seems each task cannot see counters from each other, since JT maintain= s a unique counter for each tasks;

From your this comment "I meant that if a Task created and updated= a counter, a different Task has access to that counter. " -- it seems= different tasks could share/access the same counter.

Appreciate if = you could help to clarify a bit.

regards,
Lin

On Sat, Oct 20, 2012 = at 12:42 AM, Michael Segel <michael_segel@hotmail.com> wrote:

On Oct 19, 2012, at 11:27 AM, Lin Ma <linlma@gmail.com> wrote:

Hi Mike,

Thanks for the detailed reply= . Two quick questions/comments,

1. For "task", you mean a = specific mapper instance, or a specific reducer instance?

Either.=A0

2. "However, I do not believe that a separate Task could connect wit= h the JT and see if the counter exists or if it could get a value or even an=20 accurate value since the updates are asynchronous." -- do you mean if = a mapper is updating custom counter ABC, and another mapper is updating the= same customer counter ABC, their counter values are updated independently = by different mappers, and will not published (aggregated) externally until = job completed successfully?

I meant that if a Task created and updated a counter= , a different Task has access to that counter.=A0

= To give you an example, if I want to count the number of quality errors and= then fail after X number of errors, I can't use Global counters to do = this.

regards,
Lin

On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <michael_segel@hotmail.com> wrote:
As I und= erstand it... each Task has its own counters and are independently updated.= As they report back to the JT, they update the counter(s)' status. The JT then will aggregate them.=A0

In terms of pe= rformance, Counters take up some memory in the JT so while its OK to use th= em, if you abuse them, you can run in to issues.=A0
As to limits.= .. I guess that will depend on the amount of memory on the JT machine, the = size of the cluster (Number of TT) and the number of counters.=A0

In terms of global accessibility... Maybe.
The reason I say maybe is that I'm not sure by what you me= an by globally accessible.=A0
If a task creates and implements a = dynamic counter... I know that it will eventually be reflected in the JT. H= owever, I do not believe that a separate Task could connect with the JT and= see if the counter exists or if it could get a value or even an accurate v= alue since the updates are asynchronous. =A0Not to mention that I don't= believe that the counters are aggregated until the job ends. It would make= sense that the JT maintains a unique counter for each task until the tasks= complete. (If a task fails, it would have to delete the counters so that w= hen the task is restarted the correct count is maintained. ) =A0Note, I hav= en't looked at the source code so I am probably wrong.=A0

HTH
Mike
On Oct 19, 2012,= at 5:50 AM, Lin Ma <linlma@gmail.com> wrote:

Hi guys,

I have some quick questions regarding to Hadoop counter,
  • Hadoop counter (customer defined) is global accessible (for bo= th read and write) for all Mappers and Reducers in a job?
  • What is t= he performance and best practices of using Hadoop counters? I am not sure i= f using Hadoop counters too heavy, there will be performance downgrade to t= he whole job?
regards,
Lin












--20cf307c9b3e34376b04ccb4b1a8--