hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lin Ma <lin...@gmail.com>
Subject Re: Hadoop counter
Date Sun, 21 Oct 2012 06:45:42 GMT
Thanks for the detailed reply, Mike. Yes, my most confusion is resolved by
you. The last two questions (or comments) are used to confirm my
understanding is correct,

- is it normal use case or best practices for a job to consume/read the
counters from previous completed job in an automatic way? I ask this
because I am not sure whether the most use case of counter is human read
and manual analysis, other then using another job to automatic consume the
- I want to confirm my understanding is correct, when each task completes,
JT will aggregate/update the global counter values from the specific
counter values updated by the complete task, but never expose global
counters values until job completes? If it is correct, I am wondering why
JT doing aggregation each time when a task completes, other than doing a
one time aggregation when the job completes? Is there any design choice
reasons? thanks.


On Sat, Oct 20, 2012 at 3:12 PM, Michael Segel <michael_segel@hotmail.com>wrote:

> On Oct 19, 2012, at 10:27 PM, Lin Ma <linlma@gmail.com> wrote:
> Thanks for the detailed reply Mike, I learned a lot from the discussion.
> - I just want to confirm with you that, supposing in the same job, when a
> specific task completed (and counter is aggregated in JT after the task
> completed from our discussion?), the other running task in the same job
> cannot get the updated counter value from the previous completed task? I am
> asking this because I am thinking whether I can use counter to share a
> global value between tasks.
> Yes that is correct.
> While I haven't looked at YARN (M/R 2.0) , M/R 1.x doesn't have an easy
> way for a task to query the job tracker. This might have changed in YARN
> - If so, what is the traditional use case of counter, only use counter
> values after the whole job completes?
> Yes the counters are used to provide data at the end of the job...
> BTW: appreciate if you could share me a few use cases from your experience
> about how counters are used.
> Well you have your typical job data like the number of records processed,
> total number of bytes read,  bytes written...
> But suppose you wanted to do some quality control on your input.
> So you need to keep a track on the count of bad records.  If this job is
> part of a process, you may want to include business logic in your job to
> halt the job flow if X% of the records contain bad data.
> Or your process takes input records and in processing them, they sort the
> records based on some characteristic and you want to count those sorted
> records as you processed them.
> For a more concrete example, the Illinois Tollway has these 'fast pass'
> lanes where cars equipped with RFID tags can have the tolls automatically
> deducted from their accounts rather than pay the toll manually each time.
> Suppose we wanted to determine how many cars in the 'Fast Pass' lanes are
> cheaters where they drive through the sensor and the sensor doesn't capture
> the RFID tag. (Note its possible that you have a false positive where the
> car has an RFID chip but doesn't trip the sensor.) Pushing the data in a
> map/reduce job would require the use of counters.
> Does that help?
> -Mike
> regards,
> Lin
> On Sat, Oct 20, 2012 at 5:05 AM, Michael Segel <michael_segel@hotmail.com>wrote:
>> Yeah, sorry...
>> I meant that if you were dynamically creating a counter foo in the Mapper
>> task, then each mapper would be creating their own counter foo.
>> As the job runs, these counters will eventually be sent up to the JT. The
>> job tracker would keep a separate counter for each task.
>> At the end, the final count is aggregated from the list of counters for
>> foo.
>> I don't know how you can get a task to ask information from the Job
>> Tracker on how things are going in other tasks.  That is what I meant that
>> you couldn't get information about the other counters or even the status of
>> the other tasks running in the same job.
>> I didn't see anything in the APIs that allowed for that type of flow...
>> Of course having said that... someone pops up with a way to do just that.
>> ;-)
>> Does that clarify things?
>> -Mike
>> On Oct 19, 2012, at 11:56 AM, Lin Ma <linlma@gmail.com> wrote:
>> Hi Mike,
>> Sorry I am a bit lost... As you are thinking faster than me. :-P
>> From your this statement "It would make sense that the JT maintains a
>> unique counter for each task until the tasks complete." -- it seems each
>> task cannot see counters from each other, since JT maintains a unique
>> counter for each tasks;
>> From your this comment "I meant that if a Task created and updated a
>> counter, a different Task has access to that counter. " -- it seems
>> different tasks could share/access the same counter.
>> Appreciate if you could help to clarify a bit.
>> regards,
>> Lin
>> On Sat, Oct 20, 2012 at 12:42 AM, Michael Segel <
>> michael_segel@hotmail.com> wrote:
>>> On Oct 19, 2012, at 11:27 AM, Lin Ma <linlma@gmail.com> wrote:
>>> Hi Mike,
>>> Thanks for the detailed reply. Two quick questions/comments,
>>> 1. For "task", you mean a specific mapper instance, or a specific
>>> reducer instance?
>>> Either.
>>> 2. "However, I do not believe that a separate Task could connect with
>>> the JT and see if the counter exists or if it could get a value or even an
>>> accurate value since the updates are asynchronous." -- do you mean if a
>>> mapper is updating custom counter ABC, and another mapper is updating the
>>> same customer counter ABC, their counter values are updated independently
>>> by different mappers, and will not published (aggregated) externally until
>>> job completed successfully?
>>> I meant that if a Task created and updated a counter, a different Task
>>> has access to that counter.
>>> To give you an example, if I want to count the number of quality errors
>>> and then fail after X number of errors, I can't use Global counters to do
>>> this.
>>> regards,
>>> Lin
>>> On Fri, Oct 19, 2012 at 10:35 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>> As I understand it... each Task has its own counters and are
>>>> independently updated. As they report back to the JT, they update the
>>>> counter(s)' status.
>>>> The JT then will aggregate them.
>>>> In terms of performance, Counters take up some memory in the JT so
>>>> while its OK to use them, if you abuse them, you can run in to issues.
>>>> As to limits... I guess that will depend on the amount of memory on the
>>>> JT machine, the size of the cluster (Number of TT) and the number of
>>>> counters.
>>>> In terms of global accessibility... Maybe.
>>>> The reason I say maybe is that I'm not sure by what you mean by
>>>> globally accessible.
>>>> If a task creates and implements a dynamic counter... I know that it
>>>> will eventually be reflected in the JT. However, I do not believe that a
>>>> separate Task could connect with the JT and see if the counter exists or
>>>> it could get a value or even an accurate value since the updates are
>>>> asynchronous.  Not to mention that I don't believe that the counters are
>>>> aggregated until the job ends. It would make sense that the JT maintains
>>>> unique counter for each task until the tasks complete. (If a task fails,
>>>> would have to delete the counters so that when the task is restarted the
>>>> correct count is maintained. )  Note, I haven't looked at the source code
>>>> so I am probably wrong.
>>>> HTH
>>>> Mike
>>>> On Oct 19, 2012, at 5:50 AM, Lin Ma <linlma@gmail.com> wrote:
>>>> Hi guys,
>>>> I have some quick questions regarding to Hadoop counter,
>>>>    - Hadoop counter (customer defined) is global accessible (for both
>>>>    read and write) for all Mappers and Reducers in a job?
>>>>    - What is the performance and best practices of using Hadoop
>>>>    counters? I am not sure if using Hadoop counters too heavy, there will
>>>>    performance downgrade to the whole job?
>>>> regards,
>>>> Lin

View raw message