Return-Path: X-Original-To: apmail-hadoop-yarn-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5F994107C5 for ; Wed, 8 May 2013 19:32:03 +0000 (UTC) Received: (qmail 18801 invoked by uid 500); 8 May 2013 19:32:03 -0000 Delivered-To: apmail-hadoop-yarn-dev-archive@hadoop.apache.org Received: (qmail 18762 invoked by uid 500); 8 May 2013 19:32:03 -0000 Mailing-List: contact yarn-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-dev@hadoop.apache.org Delivered-To: mailing list yarn-dev@hadoop.apache.org Received: (qmail 18751 invoked by uid 99); 8 May 2013 19:32:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 May 2013 19:32:03 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 216.145.54.171 is neither permitted nor denied by domain of evans@yahoo-inc.com) Received: from [216.145.54.171] (HELO mrout1.yahoo.com) (216.145.54.171) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 May 2013 19:31:58 +0000 Received: from GQ1-EX10-CAHT14.y.corp.yahoo.com (gq1-ex10-caht14.corp.gq1.yahoo.com [10.73.119.195]) by mrout1.yahoo.com (8.14.4/8.14.4/y.out) with ESMTP id r48JUZuY043611 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL) for ; Wed, 8 May 2013 12:30:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=yahoo-inc.com; s=cobra; t=1368041436; bh=c8U5V2aEln0amgZVt8wyElHN8RiCQPLmJuAPUtjdrDc=; h=From:To:Subject:Date:Message-ID:In-Reply-To:Content-Type: Content-ID:Content-Transfer-Encoding:MIME-Version; b=REstXfvCRVtw8cHHq8pACJUivPWkExgwQOBcIqPAj5aE4rmkciqggx1qY4WgyXUG8 yA9seNO57gWvQZFpppl43QQrTTtbtGf0ct0sSg8XhNagDGQFVLzPvhIktdZkFua97J RToy1FpA8CuXf/gg/zc12yVmgnlarWxcK7Vq8Isw= Received: from GQ1-MB01-02.y.corp.yahoo.com ([fe80::a049:b5af:9055:ada6]) by GQ1-EX10-CAHT14.y.corp.yahoo.com ([fe80::798d:a9eb:e873:311b%12]) with mapi id 14.03.0123.003; Wed, 8 May 2013 12:30:35 -0700 From: Robert Evans To: "yarn-dev@hadoop.apache.org" Subject: Re: question about cpu utilization Thread-Topic: question about cpu utilization Thread-Index: AQHOTABzL/99NrNHQ0yxKYdm34X/ypj7lRYAgABXZgD//7DDAIAAVSsAgAACmQD//9mDAA== Date: Wed, 8 May 2013 19:30:34 +0000 Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.3.4.130416 x-originating-ip: [10.74.91.218] Content-Type: text/plain; charset="iso-2022-jp" Content-ID: <10125F2E1B24434DBF6C88D79D5C64A9@yforest.corp.yahoo.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Milter-Version: master.31+4-gbc07cd5+ X-CLX-ID: 041436000 X-Virus-Checked: Checked by ClamAV on apache.org The I am really not sure what is happening. Try profiling your task. --Bobby On 5/8/13 11:48 AM, "=1B$B5mC{>9=1B(B" wrote: >Just for simplicity, I run only one map task for such as 256mb, then I set >my io.sort.memory to more than 512mb to make sure all input can stay in >memory, I also check the log to make sure there is just on spill happen >for >flushing. > >So I think the different part run one by one, but the cpu utilization is >out of my expect. > > >2013/5/9 =1B$B5mC{>9=1B(B > >> I have enough memory, so there will be only one sort and spill. Why do >> they will happen parallel? >> >> >> 2013/5/9 Robert Evans >> >>> Yes it all happens in parallel even on a single task >>> >>> On 5/8/13 11:17 AM, "=1B$B5mC{>9=1B(B" wrote: >>> >>> >I forget to say, for see the behavior of single task=1B$B!$=1B(B I jus= t run one >>>map >>> >task for 1G input-split=1B$B!J=1B(BI set block size to 1GB) >>> > >>> > >>> >2013/5/9 Robert Evans >>> > >>> >> Deciding on the input split happens in the client. Each map process >>> >>just >>> >> opens up the input file and seeks to the appropriate offset in the >>> file. >>> >> At that point it reads each entry one at a time and sends it to the >>>map >>> >> task. The output of the map task is placed in a buffer. When the >>> >>buffer >>> >> gets close to full the data is sorted and spilled out to disk in >>> >>parallel >>> >> with the map task still running. It is hard to get CPU time for the >>> >> different parts because they are all happening in parallel. If you >>>do >>> >>have >>> >> enough ram to store the entire output in memory and you have >>>configured >>> >> your sort buffer to be able to hold it all then you will probably >>>only >>> >> sort/spill once. >>> >> >>> >> --Bobby >>> >> >>> >> On 5/8/13 10:25 AM, "=1B$B5mC{>9=1B(B" wrote: >>> >> >>> >> >I saw the application container log to trace the map-reduce >>> >>application. >>> >> > >>> >> >For map task, I find there are mainly 3 phase: spilit input, sort >>>and >>> >> >spill >>> >> >out. >>> >> >I set the enough memory to make sure the input can stay in memory. >>> >> > >>> >> >Initially, I thought the highest cpu utilization will appear in >>>sort >>> >>phase >>> >> >because the other two phase focus on IO,however, it doesn't behave >>>as >>> >>what >>> >> >I thought. On the contrary, the cpu utilization during the other >>> phase >>> >> >are >>> >> >higher. >>> >> > >>> >> >Anyone know the reason? >>> >> > >>> >> >-- >>> >> >*Sincerely,* >>> >> >*Zhaojie* >>> >> >* >>> >> >* >>> >> >>> >> >>> > >>> > >>> >-- >>> >*Sincerely,* >>> >*Zhaojie* >>> >* >>> >* >>> >>> >> >> >> -- >> *Sincerely,* >> *Zhaojie* >> * >> * >> > > > >--=20 >*Sincerely,* >*Zhaojie* >* >*