hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Hadoop counter
Date Sat, 20 Oct 2012 07:12:09 GMT

On Oct 19, 2012, at 10:27 PM, Lin Ma <linlma@gmail.com> wrote:

> Thanks for the detailed reply Mike, I learned a lot from the discussion.
> 
> - I just want to confirm with you that, supposing in the same job, when a specific task
completed (and counter is aggregated in JT after the task completed from our discussion?),
the other running task in the same job cannot get the updated counter value from the previous
completed task? I am asking this because I am thinking whether I can use counter to share
a global value between tasks.

Yes that is correct. 
While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy way for a task to
query the job tracker. This might have changed in YARN

> - If so, what is the traditional use case of counter, only use counter values after the
whole job completes?
> 
Yes the counters are used to provide data at the end of the job... 
> BTW: appreciate if you could share me a few use cases from your experience about how
counters are used.
> 
Well you have your typical job data like the number of records processed, total number of
bytes read,  bytes written... 

But suppose you wanted to do some quality control on your input. 
So you need to keep a track on the count of bad records.  If this job is part of a process,
you may want to include business logic in your job to halt the job flow if X% of the records
contain bad data. 

Or your process takes input records and in processing them, they sort the records based on
some characteristic and you want to count those sorted records as you processed them. 

For a more concrete example, the Illinois Tollway has these 'fast pass' lanes where cars equipped
with RFID tags can have the tolls automatically deducted from their accounts rather than pay
the toll manually each time. 

Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are cheaters where they
drive through the sensor and the sensor doesn't capture the RFID tag. (Note its possible that
you have a false positive where the car has an RFID chip but doesn't trip the sensor.) Pushing
the data in a map/reduce job would require the use of counters.

Does that help? 

-Mike

> regards,
> Lin
> 
> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <michael_segel@hotmail.com> wrote:
> Yeah, sorry... 
> 
> I meant that if you were dynamically creating a counter foo in the Mapper task, then
each mapper would be creating their own counter foo. 
> As the job runs, these counters will eventually be sent up to the JT. The job tracker
would keep a separate counter for each task. 
> 
> At the end, the final count is aggregated from the list of counters for foo. 
> 
> 
> I don't know how you can get a task to ask information from the Job Tracker on how things
are going in other tasks.  That is what I meant that you couldn't get information about the
other counters or even the status of the other tasks running in the same job. 
> 
> I didn't see anything in the APIs that allowed for that type of flow... Of course having
said that... someone pops up with a way to do just that. ;-) 
> 
> 
> Does that clarify things? 
> 
> -Mike
> 
> 
> On Oct 19, 2012, at 11:56 AM, Lin Ma <linlma@gmail.com> wrote:
> 
>> Hi Mike,
>> 
>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>> 
>> From your this statement "It would make sense that the JT maintains a unique counter
for each task until the tasks complete." -- it seems each task cannot see counters from each
other, since JT maintains a unique counter for each tasks;
>> 
>> From your this comment "I meant that if a Task created and updated a counter, a different
Task has access to that counter. " -- it seems different tasks could share/access the same
counter.
>> 
>> Appreciate if you could help to clarify a bit.
>> 
>> regards,
>> Lin
>> 
>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <michael_segel@hotmail.com>
wrote:
>> 
>> On Oct 19, 2012, at 11:27 AM, Lin Ma <linlma@gmail.com> wrote:
>> 
>>> Hi Mike,
>>> 
>>> Thanks for the detailed reply. Two quick questions/comments,
>>> 
>>> 1. For "task", you mean a specific mapper instance, or a specific reducer instance?
>> 
>> Either. 
>> 
>>> 2. "However, I do not believe that a separate Task could connect with the JT
and see if the counter exists or if it could get a value or even an accurate value since the
updates are asynchronous." -- do you mean if a mapper is updating custom counter ABC, and
another mapper is updating the same customer counter ABC, their counter values are updated
independently by different mappers, and will not published (aggregated) externally until job
completed successfully?
>>> 
>> I meant that if a Task created and updated a counter, a different Task has access
to that counter. 
>> 
>> To give you an example, if I want to count the number of quality errors and then
fail after X number of errors, I can't use Global counters to do this.
>> 
>>> regards,
>>> Lin
>>> 
>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <michael_segel@hotmail.com>
wrote:
>>> As I understand it... each Task has its own counters and are independently updated.
As they report back to the JT, they update the counter(s)' status.
>>> The JT then will aggregate them. 
>>> 
>>> In terms of performance, Counters take up some memory in the JT so while its
OK to use them, if you abuse them, you can run in to issues. 
>>> As to limits... I guess that will depend on the amount of memory on the JT machine,
the size of the cluster (Number of TT) and the number of counters. 
>>> 
>>> In terms of global accessibility... Maybe.
>>> 
>>> The reason I say maybe is that I'm not sure by what you mean by globally accessible.

>>> If a task creates and implements a dynamic counter... I know that it will eventually
be reflected in the JT. However, I do not believe that a separate Task could connect with
the JT and see if the counter exists or if it could get a value or even an accurate value
since the updates are asynchronous.  Not to mention that I don't believe that the counters
are aggregated until the job ends. It would make sense that the JT maintains a unique counter
for each task until the tasks complete. (If a task fails, it would have to delete the counters
so that when the task is restarted the correct count is maintained. )  Note, I haven't looked
at the source code so I am probably wrong. 
>>> 
>>> HTH
>>> Mike
>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <linlma@gmail.com> wrote:
>>> 
>>>> Hi guys,
>>>> 
>>>> I have some quick questions regarding to Hadoop counter,
>>>> 
>>>> Hadoop counter (customer defined) is global accessible (for both read and
write) for all Mappers and Reducers in a job?
>>>> What is the performance and best practices of using Hadoop counters? I am
not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole
job?
>>>> regards,
>>>> Lin
>>> 
>>> 
>> 
>> 
> 
> 


Mime
View raw message