hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Evans <ev...@yahoo-inc.com>
Subject Re: question about cpu utilization
Date Wed, 08 May 2013 16:34:10 GMT
Yes it all happens in parallel even on a single task

On 5/8/13 11:17 AM, "牛兆捷" <nzjemail@gmail.com> wrote:

>I forget to say, for see the behavior of single task, I just run one map
>task for 1G input-split(I set block size to 1GB)
>2013/5/9 Robert Evans <evans@yahoo-inc.com>
>> Deciding on the input split happens in the client.  Each map process
>> opens up the input file and seeks to the appropriate offset in the file.
>> At that point it reads each entry one at a time and sends it to the map
>> task.  The output of the map task is placed in a buffer.  When the
>> gets close to full the data is sorted and spilled out to disk in
>> with the map task still running.  It is hard to get CPU time for the
>> different parts because they are all happening in parallel. If you do
>> enough ram to store the entire output in memory and you have configured
>> your sort buffer to be able to hold it all then you will probably only
>> sort/spill once.
>> --Bobby
>> On 5/8/13 10:25 AM, "牛兆捷" <nzjemail@gmail.com> wrote:
>> >I saw the application container log to trace the map-reduce
>> >
>> >For map task, I find there are mainly 3 phase: spilit input, sort and
>> >spill
>> >out.
>> >I set the enough memory to make sure the input can stay in memory.
>> >
>> >Initially, I thought the highest cpu utilization will appear in sort
>> >because the other two phase focus on IO,however, it doesn't behave as
>> >I thought. On the contrary, the cpu utilization during  the other phase
>> >are
>> >higher.
>> >
>> >Anyone know the reason?
>> >
>> >--
>> >*Sincerely,*
>> >*Zhaojie*
>> >*
>> >*

View raw message