hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 牛兆捷 <nzjem...@gmail.com>
Subject Re: question about cpu utilization
Date Wed, 08 May 2013 16:48:16 GMT
Just for simplicity, I run only one map task for such as 256mb, then I set
my io.sort.memory to more than 512mb to make sure all input can stay in
memory, I also check the log to make sure there is just on spill happen for
flushing.

So I think the different part run one by one, but the cpu utilization is
out of my expect.


2013/5/9 牛兆捷 <nzjemail@gmail.com>

> I have enough memory, so there will be only one sort and spill. Why do
> they will happen parallel?
>
>
> 2013/5/9 Robert Evans <evans@yahoo-inc.com>
>
>> Yes it all happens in parallel even on a single task
>>
>> On 5/8/13 11:17 AM, "牛兆捷" <nzjemail@gmail.com> wrote:
>>
>> >I forget to say, for see the behavior of single task, I just run one map
>> >task for 1G input-split(I set block size to 1GB)
>> >
>> >
>> >2013/5/9 Robert Evans <evans@yahoo-inc.com>
>> >
>> >> Deciding on the input split happens in the client.  Each map process
>> >>just
>> >> opens up the input file and seeks to the appropriate offset in the
>> file.
>> >> At that point it reads each entry one at a time and sends it to the map
>> >> task.  The output of the map task is placed in a buffer.  When the
>> >>buffer
>> >> gets close to full the data is sorted and spilled out to disk in
>> >>parallel
>> >> with the map task still running.  It is hard to get CPU time for the
>> >> different parts because they are all happening in parallel. If you do
>> >>have
>> >> enough ram to store the entire output in memory and you have configured
>> >> your sort buffer to be able to hold it all then you will probably only
>> >> sort/spill once.
>> >>
>> >> --Bobby
>> >>
>> >> On 5/8/13 10:25 AM, "牛兆捷" <nzjemail@gmail.com> wrote:
>> >>
>> >> >I saw the application container log to trace the map-reduce
>> >>application.
>> >> >
>> >> >For map task, I find there are mainly 3 phase: spilit input, sort and
>> >> >spill
>> >> >out.
>> >> >I set the enough memory to make sure the input can stay in memory.
>> >> >
>> >> >Initially, I thought the highest cpu utilization will appear in sort
>> >>phase
>> >> >because the other two phase focus on IO,however, it doesn't behave as
>> >>what
>> >> >I thought. On the contrary, the cpu utilization during  the other
>> phase
>> >> >are
>> >> >higher.
>> >> >
>> >> >Anyone know the reason?
>> >> >
>> >> >--
>> >> >*Sincerely,*
>> >> >*Zhaojie*
>> >> >*
>> >> >*
>> >>
>> >>
>> >
>> >
>> >--
>> >*Sincerely,*
>> >*Zhaojie*
>> >*
>> >*
>>
>>
>
>
> --
> *Sincerely,*
> *Zhaojie*
> *
> *
>



-- 
*Sincerely,*
*Zhaojie*
*
*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message