Mailing-List: contact yarn-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-dev@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: 216.145.54.171 is neither permitted
 nor denied by domain of evans@yahoo-inc.com)
From: Robert Evans <evans@yahoo-inc.com>
To: "yarn-dev@hadoop.apache.org" <yarn-dev@hadoop.apache.org>
Subject: Re: question about cpu utilization
Thread-Topic: question about cpu utilization
Thread-Index: 
 AQHOTABzL/99NrNHQ0yxKYdm34X/ypj7lRYAgABXZgD//7DDAIAAVSsAgAACmQD//9mDAA==
Date: Wed, 8 May 2013 19:30:34 +0000
Message-ID: <CDB011F5.10415%evans@yahoo-inc.com>
In-Reply-To: 
 <CAHc8ag1JEDVh_mZo+q8sqhsWFj6+cDWWsqvxXcM=4tY8WiJwuQ@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.3.4.130416
Content-Type: text/plain; charset="iso-2022-jp"
Content-ID: <10125F2E1B24434DBF6C88D79D5C64A9@yforest.corp.yahoo.com>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

The I am really not sure what is happening.  Try profiling your task.

--Bobby

On 5/8/13 11:48 AM, "=1B$B5mC{>9=1B(B" <nzjemail@gmail.com> wrote:

>Just for simplicity, I run only one map task for such as 256mb, then I set
>my io.sort.memory to more than 512mb to make sure all input can stay in
>memory, I also check the log to make sure there is just on spill happen
>for
>flushing.
>
>So I think the different part run one by one, but the cpu utilization is
>out of my expect.
>
>
>2013/5/9 =1B$B5mC{>9=1B(B <nzjemail@gmail.com>
>
>> I have enough memory, so there will be only one sort and spill. Why do
>> they will happen parallel?
>>
>>
>> 2013/5/9 Robert Evans <evans@yahoo-inc.com>
>>
>>> Yes it all happens in parallel even on a single task
>>>
>>> On 5/8/13 11:17 AM, "=1B$B5mC{>9=1B(B" <nzjemail@gmail.com> wrote:
>>>
>>> >I forget to say, for see the behavior of single task=1B$B!$=1B(B I jus=
t run one
>>>map
>>> >task for 1G input-split=1B$B!J=1B(BI set block size to 1GB)
>>> >
>>> >
>>> >2013/5/9 Robert Evans <evans@yahoo-inc.com>
>>> >
>>> >> Deciding on the input split happens in the client.  Each map process
>>> >>just
>>> >> opens up the input file and seeks to the appropriate offset in the
>>> file.
>>> >> At that point it reads each entry one at a time and sends it to the
>>>map
>>> >> task.  The output of the map task is placed in a buffer.  When the
>>> >>buffer
>>> >> gets close to full the data is sorted and spilled out to disk in
>>> >>parallel
>>> >> with the map task still running.  It is hard to get CPU time for the
>>> >> different parts because they are all happening in parallel. If you
>>>do
>>> >>have
>>> >> enough ram to store the entire output in memory and you have
>>>configured
>>> >> your sort buffer to be able to hold it all then you will probably
>>>only
>>> >> sort/spill once.
>>> >>
>>> >> --Bobby
>>> >>
>>> >> On 5/8/13 10:25 AM, "=1B$B5mC{>9=1B(B" <nzjemail@gmail.com> wrote:
>>> >>
>>> >> >I saw the application container log to trace the map-reduce
>>> >>application.
>>> >> >
>>> >> >For map task, I find there are mainly 3 phase: spilit input, sort
>>>and
>>> >> >spill
>>> >> >out.
>>> >> >I set the enough memory to make sure the input can stay in memory.
>>> >> >
>>> >> >Initially, I thought the highest cpu utilization will appear in
>>>sort
>>> >>phase
>>> >> >because the other two phase focus on IO,however, it doesn't behave
>>>as
>>> >>what
>>> >> >I thought. On the contrary, the cpu utilization during  the other
>>> phase
>>> >> >are
>>> >> >higher.
>>> >> >
>>> >> >Anyone know the reason?
>>> >> >
>>> >> >--
>>> >> >*Sincerely,*
>>> >> >*Zhaojie*
>>> >> >*
>>> >> >*
>>> >>
>>> >>
>>> >
>>> >
>>> >--
>>> >*Sincerely,*
>>> >*Zhaojie*
>>> >*
>>> >*
>>>
>>>
>>
>>
>> --
>> *Sincerely,*
>> *Zhaojie*
>> *
>> *
>>
>
>
>
>--=20
>*Sincerely,*
>*Zhaojie*
>*
>*