hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Evans <ev...@yahoo-inc.com>
Subject Re: question about cpu utilization
Date Wed, 08 May 2013 19:30:34 GMT
The I am really not sure what is happening.  Try profiling your task.

--Bobby

On 5/8/13 11:48 AM, "牛兆捷" <nzjemail@gmail.com> wrote:

>Just for simplicity, I run only one map task for such as 256mb, then I set
>my io.sort.memory to more than 512mb to make sure all input can stay in
>memory, I also check the log to make sure there is just on spill happen
>for
>flushing.
>
>So I think the different part run one by one, but the cpu utilization is
>out of my expect.
>
>
>2013/5/9 牛兆捷 <nzjemail@gmail.com>
>
>> I have enough memory, so there will be only one sort and spill. Why do
>> they will happen parallel?
>>
>>
>> 2013/5/9 Robert Evans <evans@yahoo-inc.com>
>>
>>> Yes it all happens in parallel even on a single task
>>>
>>> On 5/8/13 11:17 AM, "牛兆捷" <nzjemail@gmail.com> wrote:
>>>
>>> >I forget to say, for see the behavior of single task, I just run one
>>>map
>>> >task for 1G input-split(I set block size to 1GB)
>>> >
>>> >
>>> >2013/5/9 Robert Evans <evans@yahoo-inc.com>
>>> >
>>> >> Deciding on the input split happens in the client.  Each map process
>>> >>just
>>> >> opens up the input file and seeks to the appropriate offset in the
>>> file.
>>> >> At that point it reads each entry one at a time and sends it to the
>>>map
>>> >> task.  The output of the map task is placed in a buffer.  When the
>>> >>buffer
>>> >> gets close to full the data is sorted and spilled out to disk in
>>> >>parallel
>>> >> with the map task still running.  It is hard to get CPU time for the
>>> >> different parts because they are all happening in parallel. If you
>>>do
>>> >>have
>>> >> enough ram to store the entire output in memory and you have
>>>configured
>>> >> your sort buffer to be able to hold it all then you will probably
>>>only
>>> >> sort/spill once.
>>> >>
>>> >> --Bobby
>>> >>
>>> >> On 5/8/13 10:25 AM, "牛兆捷" <nzjemail@gmail.com> wrote:
>>> >>
>>> >> >I saw the application container log to trace the map-reduce
>>> >>application.
>>> >> >
>>> >> >For map task, I find there are mainly 3 phase: spilit input, sort
>>>and
>>> >> >spill
>>> >> >out.
>>> >> >I set the enough memory to make sure the input can stay in memory.
>>> >> >
>>> >> >Initially, I thought the highest cpu utilization will appear in
>>>sort
>>> >>phase
>>> >> >because the other two phase focus on IO,however, it doesn't behave
>>>as
>>> >>what
>>> >> >I thought. On the contrary, the cpu utilization during  the other
>>> phase
>>> >> >are
>>> >> >higher.
>>> >> >
>>> >> >Anyone know the reason?
>>> >> >
>>> >> >--
>>> >> >*Sincerely,*
>>> >> >*Zhaojie*
>>> >> >*
>>> >> >*
>>> >>
>>> >>
>>> >
>>> >
>>> >--
>>> >*Sincerely,*
>>> >*Zhaojie*
>>> >*
>>> >*
>>>
>>>
>>
>>
>> --
>> *Sincerely,*
>> *Zhaojie*
>> *
>> *
>>
>
>
>
>-- 
>*Sincerely,*
>*Zhaojie*
>*
>*


Mime
View raw message