Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9CC65D27E for ; Sun, 21 Oct 2012 12:16:36 +0000 (UTC) Received: (qmail 72381 invoked by uid 500); 21 Oct 2012 12:16:31 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 72205 invoked by uid 500); 21 Oct 2012 12:16:31 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 72163 invoked by uid 99); 21 Oct 2012 12:16:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 21 Oct 2012 12:16:29 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of michael_segel@hotmail.com designates 65.55.111.93 as permitted sender) Received: from [65.55.111.93] (HELO blu0-omc2-s18.blu0.hotmail.com) (65.55.111.93) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 21 Oct 2012 12:16:18 +0000 Received: from BLU0-SMTP198 ([65.55.111.73]) by blu0-omc2-s18.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Sun, 21 Oct 2012 05:15:57 -0700 X-Originating-IP: [173.15.87.37] X-EIP: [nbCG5sJZHOHEFEwoCCkI6O/Ajj2zPrPz] X-Originating-Email: [michael_segel@hotmail.com] Message-ID: Received: from [192.168.0.104] ([173.15.87.37]) by BLU0-SMTP198.blu0.hotmail.com over TLS secured channel with Microsoft SMTPSVC(6.0.3790.4675); Sun, 21 Oct 2012 05:15:55 -0700 From: Michael Segel Content-Type: multipart/alternative; boundary="Apple-Mail=_F24AA8DD-4AF5-48D9-A7E2-478C737A62A4" MIME-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: Hadoop counter Date: Sun, 21 Oct 2012 07:15:54 -0500 References: To: user@hadoop.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1499) X-OriginalArrivalTime: 21 Oct 2012 12:15:55.0944 (UTC) FILETIME=[D26C5A80:01CDAF85] X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_F24AA8DD-4AF5-48D9-A7E2-478C737A62A4 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="iso-8859-1" On Oct 21, 2012, at 1:45 AM, Lin Ma wrote: > Thanks for the detailed reply, Mike. Yes, my most confusion is = resolved by you. The last two questions (or comments) are used to = confirm my understanding is correct, >=20 > - is it normal use case or best practices for a job to consume/read = the counters from previous completed job in an automatic way? I ask this = because I am not sure whether the most use case of counter is human read = and manual analysis, other then using another job to automatic consume = the counters? Lin,=20 Every job has a set of counters to maintain job statistics.=20 This is specifically for human analysis and to help understand what = happened with your job.=20 It allows you to see how much data is read in by the job, how many = records processed to be measured against how long the job took to = complete. It also showed you how much data is written back out. =20 In addition to this, a set of use cases for counters in Hadoop center = on quality control. Its normal to chain jobs together to form a job = flow.=20 A typical use case for Hadoop is to pull data from various sources, = combine them and do some process on them, resulting in a data set that = gets sent to another system for visualization.=20 In this use case, there are usually data cleansing and validation jobs. = As they run, its possible to track a number of defective records. At the = end of that specific job, from the ToolRunner, or whichever job class = you used to launch your job, you can then get these aggregated counters = for the job and determine if the process passed or failed. Based on = this, you can exit your program with either a success or failed flag. = Job Flow control tools like Oozie can capture this and then decide to = continue or to stop and alert an operator of an error.=20 > - I want to confirm my understanding is correct, when each task = completes, JT will aggregate/update the global counter values from the = specific counter values updated by the complete task, but never expose = global counters values until job completes? If it is correct, I am = wondering why JT doing aggregation each time when a task completes, = other than doing a one time aggregation when the job completes? Is there = any design choice reasons? thanks. That's a good question. I haven't looked at the code, so I can't say = definitively when the JT performs its aggregation. However, as the job = runs and in process, we can look at the job tracker web page(s) and see = the counter summary. This would imply that there has to be some = aggregation occurring mid-flight. (It would be trivial to sum the list = of counters periodically to update the job statistics.) Note too that = if the JT web pages can show a counter, its possible to then write a = monitoring tool that can monitor the job while running and then kill the = job mid flight if a certain threshold of a counter is met.=20 That is to say you could in theory write a monitoring process and watch = the counters. If lets say an error counter hits a predetermined = threshold, you could then issue a 'hadoop job -kill ' command.=20= >=20 > regards, > Lin >=20 > On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel = wrote: >=20 > On Oct 19, 2012, at 10:27 PM, Lin Ma wrote: >=20 >> Thanks for the detailed reply Mike, I learned a lot from the = discussion. >>=20 >> - I just want to confirm with you that, supposing in the same job, = when a specific task completed (and counter is aggregated in JT after = the task completed from our discussion?), the other running task in the = same job cannot get the updated counter value from the previous = completed task? I am asking this because I am thinking whether I can use = counter to share a global value between tasks. >=20 > Yes that is correct.=20 > While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an = easy way for a task to query the job tracker. This might have changed in = YARN >=20 >> - If so, what is the traditional use case of counter, only use = counter values after the whole job completes? >>=20 > Yes the counters are used to provide data at the end of the job...=20 >=20 >> BTW: appreciate if you could share me a few use cases from your = experience about how counters are used. >>=20 > Well you have your typical job data like the number of records = processed, total number of bytes read, bytes written...=20 >=20 > But suppose you wanted to do some quality control on your input.=20 > So you need to keep a track on the count of bad records. If this job = is part of a process, you may want to include business logic in your job = to halt the job flow if X% of the records contain bad data.=20 >=20 > Or your process takes input records and in processing them, they sort = the records based on some characteristic and you want to count those = sorted records as you processed them.=20 >=20 > For a more concrete example, the Illinois Tollway has these 'fast = pass' lanes where cars equipped with RFID tags can have the tolls = automatically deducted from their accounts rather than pay the toll = manually each time.=20 >=20 > Suppose we wanted to determine how many cars in the 'Fast Pass' lanes = are cheaters where they drive through the sensor and the sensor doesn't = capture the RFID tag. (Note its possible that you have a false positive = where the car has an RFID chip but doesn't trip the sensor.) Pushing the = data in a map/reduce job would require the use of counters. >=20 > Does that help?=20 >=20 > -Mike >=20 >> regards, >> Lin >>=20 >> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel = wrote: >> Yeah, sorry...=20 >>=20 >> I meant that if you were dynamically creating a counter foo in the = Mapper task, then each mapper would be creating their own counter foo.=20= >> As the job runs, these counters will eventually be sent up to the JT. = The job tracker would keep a separate counter for each task.=20 >>=20 >> At the end, the final count is aggregated from the list of counters = for foo.=20 >>=20 >>=20 >> I don't know how you can get a task to ask information from the Job = Tracker on how things are going in other tasks. That is what I meant = that you couldn't get information about the other counters or even the = status of the other tasks running in the same job.=20 >>=20 >> I didn't see anything in the APIs that allowed for that type of = flow... Of course having said that... someone pops up with a way to do = just that. ;-)=20 >>=20 >>=20 >> Does that clarify things?=20 >>=20 >> -Mike >>=20 >>=20 >> On Oct 19, 2012, at 11:56 AM, Lin Ma wrote: >>=20 >>> Hi Mike, >>>=20 >>> Sorry I am a bit lost... As you are thinking faster than me. :-P >>>=20 >>> =46rom your this statement "It would make sense that the JT = maintains a unique counter for each task until the tasks complete." -- = it seems each task cannot see counters from each other, since JT = maintains a unique counter for each tasks; >>>=20 >>> =46rom your this comment "I meant that if a Task created and updated = a counter, a different Task has access to that counter. " -- it seems = different tasks could share/access the same counter. >>>=20 >>> Appreciate if you could help to clarify a bit. >>>=20 >>> regards, >>> Lin >>>=20 >>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel = wrote: >>>=20 >>> On Oct 19, 2012, at 11:27 AM, Lin Ma wrote: >>>=20 >>>> Hi Mike, >>>>=20 >>>> Thanks for the detailed reply. Two quick questions/comments, >>>>=20 >>>> 1. For "task", you mean a specific mapper instance, or a specific = reducer instance? >>>=20 >>> Either.=20 >>>=20 >>>> 2. "However, I do not believe that a separate Task could connect = with the JT and see if the counter exists or if it could get a value or = even an accurate value since the updates are asynchronous." -- do you = mean if a mapper is updating custom counter ABC, and another mapper is = updating the same customer counter ABC, their counter values are updated = independently by different mappers, and will not published (aggregated) = externally until job completed successfully? >>>>=20 >>> I meant that if a Task created and updated a counter, a different = Task has access to that counter.=20 >>>=20 >>> To give you an example, if I want to count the number of quality = errors and then fail after X number of errors, I can't use Global = counters to do this. >>>=20 >>>> regards, >>>> Lin >>>>=20 >>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel = wrote: >>>> As I understand it... each Task has its own counters and are = independently updated. As they report back to the JT, they update the = counter(s)' status. >>>> The JT then will aggregate them.=20 >>>>=20 >>>> In terms of performance, Counters take up some memory in the JT so = while its OK to use them, if you abuse them, you can run in to issues.=20= >>>> As to limits... I guess that will depend on the amount of memory on = the JT machine, the size of the cluster (Number of TT) and the number of = counters.=20 >>>>=20 >>>> In terms of global accessibility... Maybe. >>>>=20 >>>> The reason I say maybe is that I'm not sure by what you mean by = globally accessible.=20 >>>> If a task creates and implements a dynamic counter... I know that = it will eventually be reflected in the JT. However, I do not believe = that a separate Task could connect with the JT and see if the counter = exists or if it could get a value or even an accurate value since the = updates are asynchronous. Not to mention that I don't believe that the = counters are aggregated until the job ends. It would make sense that the = JT maintains a unique counter for each task until the tasks complete. = (If a task fails, it would have to delete the counters so that when the = task is restarted the correct count is maintained. ) Note, I haven't = looked at the source code so I am probably wrong.=20 >>>>=20 >>>> HTH >>>> Mike >>>> On Oct 19, 2012, at 5:50 AM, Lin Ma wrote: >>>>=20 >>>>> Hi guys, >>>>>=20 >>>>> I have some quick questions regarding to Hadoop counter, >>>>>=20 >>>>> Hadoop counter (customer defined) is global accessible (for both = read and write) for all Mappers and Reducers in a job? >>>>> What is the performance and best practices of using Hadoop = counters? I am not sure if using Hadoop counters too heavy, there will = be performance downgrade to the whole job? >>>>> regards, >>>>> Lin >>>>=20 >>>>=20 >>>=20 >>>=20 >>=20 >>=20 >=20 >=20 --Apple-Mail=_F24AA8DD-4AF5-48D9-A7E2-478C737A62A4 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="iso-8859-1" linlma@gmail.com> = wrote:
Thanks for the detailed reply, Mike. Yes, my most = confusion is resolved by you. The last two questions (or comments) are = used to confirm my understanding is correct,

- is it normal use = case or best practices for a job to consume/read the counters from = previous completed job in an automatic way? I ask this because I am not = sure whether the most use case of counter is human read and manual = analysis, other then using another job to automatic consume the = counters?

Lin, 
Every job = has a set of counters to maintain job statistics. 
This = is specifically for human analysis and to help understand what happened = with your job. 
It allows you to see how much data is = read in by the job, how many records processed to be measured against = how long the job took to complete.  It also showed you how much = data is written back out.  

In addition to = this,  a set of use cases for counters in Hadoop center on quality = control. Its normal to chain jobs together to form a job = flow. 
A typical use case for Hadoop is to pull data from = various sources, combine them and do some process on them, resulting in = a data set that gets sent to another system for = visualization. 

In this use case, there = are usually data cleansing and validation jobs. As they run, its = possible to track a number of defective records. At the end of that = specific job, from the ToolRunner, or whichever job class you used to = launch your job, you can then get these aggregated counters for the job = and determine if the process passed or failed.  Based on this, you = can exit your program with either a success or failed flag.  Job = Flow control tools like Oozie can capture this and then decide to = continue or to stop and alert an operator of an = error. 

- I want to confirm my understanding is correct, when each task = completes, JT will aggregate/update the global counter values from the = specific counter values updated by the complete task, but never expose = global counters values until job completes? If it is correct, I am = wondering why JT doing aggregation each time when a task completes, = other than doing a one time aggregation when the job completes? Is there = any design choice reasons? thanks.

That's = a good question. I haven't looked at the code, so I can't say = definitively when the JT performs its aggregation. However, as the job = runs and in process, we can look at the job tracker web page(s) and see = the counter summary. This would imply that there has to be some = aggregation occurring mid-flight. (It would be trivial to sum the list = of counters periodically to update the job statistics.)  Note too = that if the JT web pages can show a counter, its possible to then write = a monitoring tool that can monitor the job while running and then kill = the job mid flight if a certain threshold of a counter is = met. 

That is to say you could in theory = write a monitoring process and watch the counters. If lets say an error = counter hits a predetermined threshold, you could then issue a 'hadoop = job -kill <job-id>' command. 


regards,
Lin

On Sat, Oct 20, = 2012 at 3:12 PM, Michael Segel <michael_segel@hotmail.com> = wrote:

On Oct = 19, 2012, at 10:27 PM, Lin Ma <linlma@gmail.com> wrote:

Thanks for the detailed reply Mike, I = learned a lot from the discussion.

- I just want to confirm with = you that, supposing in the same job, when a specific task completed (and = counter is aggregated in JT after the task completed from our = discussion?), the other running task in the same job cannot get the = updated counter value from the previous completed task? I am asking this = because I am thinking whether I can use counter to share a global value = between tasks.

Yes that is = correct. 
While I haven't looked at YARN (M/R 2.0) , M/R = 1.x doesn't have an easy way for a task to query the job tracker. This = might have changed in YARN

- If so, what is the traditional use case of counter, only use counter = values after the whole job completes?

Yes the = counters are used to provide data at the end of the job... 

BTW: appreciate if you could share me a few = use cases from your experience about how counters are used.

Well you have your typical job data like the = number of records processed, total number of bytes read,  bytes = written... 

But suppose you wanted to do = some quality control on your input. 
So you need to keep a track on the count of bad records.  If = this job is part of a process, you may want to include business logic in = your job to halt the job flow if X% of the records contain bad = data. 

Or your process takes input records and in processing them, = they sort the records based on some characteristic and you want to count = those sorted records as you processed = them. 

For a more concrete example, the = Illinois Tollway has these 'fast pass' lanes where cars equipped with = RFID tags can have the tolls automatically deducted from their accounts = rather than pay the toll manually each time. 

Suppose we wanted to determine how many cars in the = 'Fast Pass' lanes are cheaters where they drive through the sensor and = the sensor doesn't capture the RFID tag. (Note its possible that you = have a false positive where the car has an RFID chip but doesn't trip = the sensor.) Pushing the data in a map/reduce job would require the use = of counters.

Does that = help? 

-Mike

regards,
Lin

On Sat, = Oct 20, 2012 at 5:05 AM, Michael Segel <michael_segel@hotmail.com> wrote:
Yeah, sorry... 

I = meant that if you were dynamically creating a counter foo in the Mapper = task, then each mapper would be creating their own counter = foo. 
As the job runs, these counters will eventually be sent up to the = JT. The job tracker would keep a separate counter for each = task. 

At the end, the final count is = aggregated from the list of counters for foo. 


I don't know how you can get a task = to ask information from the Job Tracker on how things are going in other = tasks.  That is what I meant that you couldn't get information = about the other counters or even the status of the other tasks running = in the same job. 

I didn't see anything in the APIs that allowed for = that type of flow... Of course having said that... someone pops up with = a way to do just that. = ;-) 


Does that clarify = things? 

-Mike


On = Oct 19, 2012, at 11:56 AM, Lin Ma <linlma@gmail.com> wrote:

Hi Mike,

Sorry I am a bit lost... As you are thinking faster than = me. :-P

=46rom your this statement "It would make sense that the = JT maintains a unique counter for each task until the tasks complete." = -- it seems each task cannot see counters from each other, since JT = maintains a unique counter for each tasks;

=46rom your this comment "I meant that if a Task created and updated = a counter, a different Task has access to that counter. " -- it seems = different tasks could share/access the same counter.

Appreciate = if you could help to clarify a bit.

regards,
Lin

On Sat, Oct 20, = 2012 at 12:42 AM, Michael Segel <michael_segel@hotmail.com> wrote:

On Oct 19, 2012, at = 11:27 AM, Lin Ma <linlma@gmail.com> wrote:

Hi Mike,

Thanks for the detailed = reply. Two quick questions/comments,

1. For "task", you mean a = specific mapper instance, or a specific reducer = instance?

Either. 

2. "However, I do not believe that a separate Task could = connect with the JT and see if the counter exists or if it could get a value or even an=20 accurate value since the updates are asynchronous." -- do you mean if a = mapper is updating custom counter ABC, and another mapper is updating = the same customer counter ABC, their counter values are updated = independently by different mappers, and will not published (aggregated) = externally until job completed successfully?

I meant that if a Task created and updated a = counter, a different Task has access to that = counter. 

To give you an example, if I = want to count the number of quality errors and then fail after X number = of errors, I can't use Global counters to do this.

regards,
Lin

On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel = <michael_segel@hotmail.com> wrote:
As I understand it... each Task has its = own counters and are independently updated. As they report back to the = JT, they update the counter(s)' status.
The JT then will aggregate them. 

In terms = of performance, Counters take up some memory in the JT so while its OK = to use them, if you abuse them, you can run in to = issues. 
As to limits... I guess that will depend on the = amount of memory on the JT machine, the size of the cluster (Number of = TT) and the number of counters. 

In terms of global accessibility... = Maybe.

The reason I say maybe is that I'm not = sure by what you mean by globally accessible. 
If a task = creates and implements a dynamic counter... I know that it will = eventually be reflected in the JT. However, I do not believe that a = separate Task could connect with the JT and see if the counter exists or = if it could get a value or even an accurate value since the updates are = asynchronous.  Not to mention that I don't believe that the = counters are aggregated until the job ends. It would make sense that the = JT maintains a unique counter for each task until the tasks complete. = (If a task fails, it would have to delete the counters so that when the = task is restarted the correct count is maintained. )  Note, I = haven't looked at the source code so I am probably wrong. 

HTH
Mike
On Oct 19, = 2012, at 5:50 AM, Lin Ma <linlma@gmail.com> wrote:

Hi guys,

I have some quick questions regarding to Hadoop = counter,

  • Hadoop counter (customer defined) is global = accessible (for both read and write) for all Mappers and Reducers in a = job?
  • What is the performance and best practices of using Hadoop = counters? I am not sure if using Hadoop counters too heavy, there will = be performance downgrade to the whole job?
regards,
Lin









= --Apple-Mail=_F24AA8DD-4AF5-48D9-A7E2-478C737A62A4--