Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of samliuhadoop@gmail.com
 designates 209.85.215.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOcnVr3nfgCH2hezPWS_BCp-0m0_Y9XXJci9vMR=0zM3BmDpnw@mail.gmail.com>
References: 
 <CAHH8OOfVvdz29ebiSmBPcdVz4is=mgmghd8QHp=KTkob1eGTDA@mail.gmail.com>
	<CAJs-K1t93LvFtPqiqkyJWwPbpL3X_gcMb3XcHtBituZROHVVCQ@mail.gmail.com>
	<CAHH8OOfQUG5cgfO8MJAbEAvaw_M2O2esinnHAOKaU8mSbB8x9A@mail.gmail.com>
	<CAOcnVr3nfgCH2hezPWS_BCp-0m0_Y9XXJci9vMR=0zM3BmDpnw@mail.gmail.com>
Date: Fri, 7 Jun 2013 13:21:49 +0800
Message-ID: 
 <CAHH8OOf=yP+kVecJEmpGXA086=BzZsvBfVw_P0uAPxn+v+Ok7A@mail.gmail.com>
Subject: Re: Why my tests shows Yarn is worse than MRv1 for terasort?
From: sam liu <samliuhadoop@gmail.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=047d7b6042be9aa50f04de899b29

--047d7b6042be9aa50f04de899b29
Content-Type: text/plain; charset=ISO-8859-1

The terasort execution log shows that reduce spent about 5.5 mins from 33%
to 35% as below.
13/06/10 08:02:22 INFO mapreduce.Job:  map 100% reduce 31%
13/06/10 08:02:25 INFO mapreduce.Job:  map 100% reduce 32%
13/06/10 *08:02:46* INFO mapreduce.Job:  map 100% reduce 33%
13/06/10 *08:08:16* INFO mapreduce.Job:  map 100% reduce 35%
13/06/10 08:08:19 INFO mapreduce.Job:  map 100% reduce 40%
13/06/10 08:08:22 INFO mapreduce.Job:  map 100% reduce 43%

Any way, below are my configurations for your reference. Thanks!
*(A) core-site.xml*
only define 'fs.default.name' and 'hadoop.tmp.dir'

*(B) hdfs-site.xml*
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>

  <property>
    <name>dfs.name.dir</name>
    <value>/opt/hadoop-2.0.4-alpha/temp/hadoop/dfs_name_dir</value>
  </property>

  <property>
    <name>dfs.data.dir</name>
    <value>/opt/hadoop-2.0.4-alpha/temp/hadoop/dfs_data_dir</value>
  </property>

  <property>
    <name>dfs.block.size</name>
    <value>134217728</value><!-- 128MB -->
  </property>

  <property>
    <name>dfs.namenode.handler.count</name>
    <value>64</value>
  </property>

  <property>
    <name>dfs.datanode.handler.count</name>
    <value>10</value>
  </property>

*(C) mapred-site.xml*
  <property>
    <name>mapreduce.cluster.temp.dir</name>
    <value>/opt/hadoop-2.0.4-alpha/temp/hadoop/mapreduce_temp</value>
    <description>No description</description>
    <final>true</final>
  </property>

  <property>
    <name>mapreduce.cluster.local.dir</name>
    <value>/opt/hadoop-2.0.4-alpha/temp/hadoop/mapreduce_local_dir</value>
    <description>No description</description>
    <final>true</final>
  </property>

<property>
  <name>mapreduce.child.java.opts</name>
  <value>-Xmx1000m</value>
</property>

<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
   </property>

 <property>
    <name>mapreduce.tasktracker.map.tasks.maximum</name>
    <value>8</value>
  </property>

  <property>
    <name>mapreduce.tasktracker.reduce.tasks.maximum</name>
    <value>4</value>
  </property>


  <property>
    <name>mapreduce.tasktracker.outofband.heartbeat</name>
    <value>true</value>
  </property>

*(D) yarn-site.xml*
 <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>node1:18025</value>
    <description>host is the hostname of the resource manager and
    port is the port on which the NodeManagers contact the Resource Manager.
    </description>
  </property>

  <property>
    <description>The address of the RM web application.</description>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>node1:18088</value>
  </property>


  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>node1:18030</value>
    <description>host is the hostname of the resourcemanager and port is
the port
    on which the Applications in the cluster talk to the Resource Manager.
    </description>
  </property>


  <property>
    <name>yarn.resourcemanager.address</name>
    <value>node1:18040</value>
    <description>the host is the hostname of the ResourceManager and the
port is the port on
    which the clients can talk to the Resource Manager. </description>
  </property>

  <property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/opt/hadoop-2.0.4-alpha/temp/hadoop/yarn_nm_local_dir</value>
    <description>the local directories used by the nodemanager</description>
  </property>

  <property>
    <name>yarn.nodemanager.address</name>
    <value>0.0.0.0:18050</value>
    <description>the nodemanagers bind to this port</description>
  </property>

  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>10240</value>
    <description>the amount of memory on the NodeManager in GB</description>
  </property>

  <property>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>/opt/hadoop-2.0.4-alpha/temp/hadoop/yarn_nm_app-logs</value>
    <description>directory on hdfs where the application logs are moved to
</description>
  </property>

   <property>
    <name>yarn.nodemanager.log-dirs</name>
    <value>/opt/hadoop-2.0.4-alpha/temp/hadoop/yarn_nm_log</value>
    <description>the directories used by Nodemanagers as log
directories</description>
  </property>

  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce.shuffle</value>
    <description>shuffle service that needs to be set for Map Reduce to run
</description>
  </property>

  <property>
    <name>yarn.resourcemanager.client.thread-count</name>
    <value>64</value>
  </property>

 <property>
    <name>yarn.nodemanager.resource.cpu-cores</name>
    <value>24</value>
  </property>

<property>
    <name>yarn.nodemanager.vcores-pcores-ratio</name>
    <value>3</value>
  </property>

 <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>22000</value>
  </property>

 <property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>2.1</value>
  </property>


2013/6/7 Harsh J <harsh@cloudera.com>

> Not tuning configurations at all is wrong. YARN uses memory resource
> based scheduling and hence MR2 would be requesting 1 GB minimum by
> default, causing, on base configs, to max out at 8 (due to 8 GB NM
> memory resource config) total containers. Do share your configs as at
> this point none of us can tell what it is.
>
> Obviously, it isn't our goal to make MR2 slower for users and to not
> care about such things :)
>
> On Fri, Jun 7, 2013 at 8:45 AM, sam liu <samliuhadoop@gmail.com> wrote:
> > At the begining, I just want to do a fast comparision of MRv1 and Yarn.
> But
> > they have many differences, and to be fair for comparison I did not tune
> > their configurations at all.  So I got above test results. After
> analyzing
> > the test result, no doubt, I will configure them and do comparison again.
> >
> > Do you have any idea on current test result? I think, to compare with
> MRv1,
> > Yarn is better on Map phase(teragen test), but worse on Reduce
> > phase(terasort test).
> > And any detailed suggestions/comments/materials on Yarn performance
> tunning?
> >
> > Thanks!
> >
> >
> > 2013/6/7 Marcos Luis Ortiz Valmaseda <marcosluis2186@gmail.com>
> >>
> >> Why not to tune the configurations?
> >> Both frameworks have many areas to tune:
> >> - Combiners, Shuffle optimization, Block size, etc
> >>
> >>
> >>
> >> 2013/6/6 sam liu <samliuhadoop@gmail.com>
> >>>
> >>> Hi Experts,
> >>>
> >>> We are thinking about whether to use Yarn or not in the near future,
> and
> >>> I ran teragen/terasort on Yarn and MRv1 for comprison.
> >>>
> >>> My env is three nodes cluster, and each node has similar hardware: 2
> >>> cpu(4 core), 32 mem. Both Yarn and MRv1 cluster are set on the same
> env. To
> >>> be fair, I did not make any performance tuning on their
> configurations, but
> >>> use the default configuration values.
> >>>
> >>> Before testing, I think Yarn will be much better than MRv1, if they all
> >>> use default configuration, because Yarn is a better framework than
> MRv1.
> >>> However, the test result shows some differences:
> >>>
> >>> MRv1: Hadoop-1.1.1
> >>> Yarn: Hadoop-2.0.4
> >>>
> >>> (A) Teragen: generate 10 GB data:
> >>> - MRv1: 193 sec
> >>> - Yarn: 69 sec
> >>> Yarn is 2.8 times better than MRv1
> >>>
> >>> (B) Terasort: sort 10 GB data:
> >>> - MRv1: 451 sec
> >>> - Yarn: 1136 sec
> >>> Yarn is 2.5 times worse than MRv1
> >>>
> >>> After a fast analysis, I think the direct cause might be that Yarn is
> >>> much faster than MRv1 on Map phase, but much worse on Reduce phase.
> >>>
> >>> Here I have two questions:
> >>> - Why my tests shows Yarn is worse than MRv1 for terasort?
> >>> - What's the stratage for tuning Yarn performance? Is any materials?
> >>>
> >>> Thanks!
> >>
> >>
> >>
> >>
> >> --
> >> Marcos Ortiz Valmaseda
> >> Product Manager at PDVSA
> >> http://about.me/marcosortiz
> >>
> >
>
>
>
> --
> Harsh J
>

--047d7b6042be9aa50f04de899b29
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>The terasort execution log shows that reduce spent ab=
out 5.5 mins from 33% to 35% as below. <br>13/06/10 08:02:22 INFO mapreduce=
.Job:=A0 map 100% reduce 31%<br>13/06/10 08:02:25 INFO mapreduce.Job:=A0 ma=
p 100% reduce 32%<br>
13/06/10 <b>08:02:46</b> INFO mapreduce.Job:=A0 map 100% reduce 33%<br>13/0=
6/10 <b>08:08:16</b> INFO mapreduce.Job:=A0 map 100% reduce 35%<br>13/06/10=
 08:08:19 INFO mapreduce.Job:=A0 map 100% reduce 40%<br>13/06/10 08:08:22 I=
NFO mapreduce.Job:=A0 map 100% reduce 43%<br>
<br></div><div>Any way, below are my configurations for your reference. Tha=
nks!<br></div><div><b>(A) core-site.xml</b><br>only define &#39;<a href=3D"=
http://fs.default.name">fs.default.name</a>&#39; and &#39;hadoop.tmp.dir=
9;<br>
<br></div><div><b>(B) hdfs-site.xml</b><br>=A0 &lt;property&gt;<br>=A0=A0=
=A0 &lt;name&gt;dfs.replication&lt;/name&gt;<br>=A0=A0=A0 &lt;value&gt;1&lt=
;/value&gt;<br>=A0 &lt;/property&gt;<br><br>=A0 &lt;property&gt;<br>=A0=A0=
=A0 &lt;name&gt;dfs.name.dir&lt;/name&gt;<br>
=A0=A0=A0 &lt;value&gt;/opt/hadoop-2.0.4-alpha/temp/hadoop/dfs_name_dir&lt;=
/value&gt;<br>=A0 &lt;/property&gt;<br><br>=A0 &lt;property&gt;<br>=A0=A0=
=A0 &lt;name&gt;dfs.data.dir&lt;/name&gt;<br>=A0=A0=A0 &lt;value&gt;/opt/ha=
doop-2.0.4-alpha/temp/hadoop/dfs_data_dir&lt;/value&gt;<br>
=A0 &lt;/property&gt;<br><br>=A0 &lt;property&gt;<br>=A0=A0=A0 &lt;name&gt;=
dfs.block.size&lt;/name&gt;<br>=A0=A0=A0 &lt;value&gt;134217728&lt;/value&g=
t;&lt;!-- 128MB --&gt;<br>=A0 &lt;/property&gt;<br><br>=A0 &lt;property&gt;=
<br>=A0=A0=A0 &lt;name&gt;dfs.namenode.handler.count&lt;/name&gt;<br>
=A0=A0=A0 &lt;value&gt;64&lt;/value&gt;<br>=A0 &lt;/property&gt;<br><br>=A0=
 &lt;property&gt;<br>=A0=A0=A0 &lt;name&gt;dfs.datanode.handler.count&lt;/n=
ame&gt;<br>=A0=A0=A0 &lt;value&gt;10&lt;/value&gt;<br>=A0 &lt;/property&gt;=
<br><br></div><div><b>(C) mapred-site.xml</b><br>
=A0 &lt;property&gt;<br>=A0=A0=A0 &lt;name&gt;mapreduce.cluster.temp.dir&lt=
;/name&gt;<br>=A0=A0=A0 &lt;value&gt;/opt/hadoop-2.0.4-alpha/temp/hadoop/ma=
preduce_temp&lt;/value&gt;<br>=A0=A0=A0 &lt;description&gt;No description&l=
t;/description&gt;<br>
=A0=A0=A0 &lt;final&gt;true&lt;/final&gt;<br>=A0 &lt;/property&gt;<br><br>=
=A0 &lt;property&gt;<br>=A0=A0=A0 &lt;name&gt;mapreduce.cluster.local.dir&l=
t;/name&gt;<br>=A0=A0=A0 &lt;value&gt;/opt/hadoop-2.0.4-alpha/temp/hadoop/m=
apreduce_local_dir&lt;/value&gt;<br>
=A0=A0=A0 &lt;description&gt;No description&lt;/description&gt;<br>=A0=A0=
=A0 &lt;final&gt;true&lt;/final&gt;<br>=A0 &lt;/property&gt;<br><br>&lt;pro=
perty&gt;<br>=A0 &lt;name&gt;mapreduce.child.java.opts&lt;/name&gt;<br>=A0 =
&lt;value&gt;-Xmx1000m&lt;/value&gt;<br>
&lt;/property&gt;<br><br>&lt;property&gt;<br>=A0=A0=A0 &lt;name&gt;<a href=
=3D"http://mapreduce.framework.name">mapreduce.framework.name</a>&lt;/name&=
gt;<br>=A0=A0=A0 &lt;value&gt;yarn&lt;/value&gt;<br>=A0=A0 &lt;/property&gt=
;<br><br>=A0&lt;property&gt;<br>
=A0=A0=A0 &lt;name&gt;mapreduce.tasktracker.map.tasks.maximum&lt;/name&gt;<=
br>=A0=A0=A0 &lt;value&gt;8&lt;/value&gt;<br>=A0 &lt;/property&gt;<br><br>=
=A0 &lt;property&gt;<br>=A0=A0=A0 &lt;name&gt;mapreduce.tasktracker.reduce.=
tasks.maximum&lt;/name&gt;<br>
=A0=A0=A0 &lt;value&gt;4&lt;/value&gt;<br>=A0 &lt;/property&gt;<br><br><br>=
=A0 &lt;property&gt;<br>=A0=A0=A0 &lt;name&gt;mapreduce.tasktracker.outofba=
nd.heartbeat&lt;/name&gt;<br>=A0=A0=A0 &lt;value&gt;true&lt;/value&gt;<br>=
=A0 &lt;/property&gt;<br>
</div><div><br></div><div><b>(D) yarn-site.xml</b><br>=A0&lt;property&gt;<b=
r>=A0=A0=A0 &lt;name&gt;yarn.resourcemanager.resource-tracker.address&lt;/n=
ame&gt;<br>=A0=A0=A0 &lt;value&gt;node1:18025&lt;/value&gt;<br>=A0=A0=A0 &l=
t;description&gt;host is the hostname of the resource manager and<br>
=A0=A0=A0 port is the port on which the NodeManagers contact the Resource M=
anager.<br>=A0=A0=A0 &lt;/description&gt;<br>=A0 &lt;/property&gt;<br><br>=
=A0 &lt;property&gt;<br>=A0=A0=A0 &lt;description&gt;The address of the RM =
web application.&lt;/description&gt;<br>
=A0=A0=A0 &lt;name&gt;yarn.resourcemanager.webapp.address&lt;/name&gt;<br>=
=A0=A0=A0 &lt;value&gt;node1:18088&lt;/value&gt;<br>=A0 &lt;/property&gt;<b=
r><br><br>=A0 &lt;property&gt;<br>=A0=A0=A0 &lt;name&gt;yarn.resourcemanage=
r.scheduler.address&lt;/name&gt;<br>
=A0=A0=A0 &lt;value&gt;node1:18030&lt;/value&gt;<br>=A0=A0=A0 &lt;descripti=
on&gt;host is the hostname of the resourcemanager and port is the port<br>=
=A0=A0=A0 on which the Applications in the cluster talk to the Resource Man=
ager.<br>=A0=A0=A0 &lt;/description&gt;<br>
=A0 &lt;/property&gt;<br><br><br>=A0 &lt;property&gt;<br>=A0=A0=A0 &lt;name=
&gt;yarn.resourcemanager.address&lt;/name&gt;<br>=A0=A0=A0 &lt;value&gt;nod=
e1:18040&lt;/value&gt;<br>=A0=A0=A0 &lt;description&gt;the host is the host=
name of the ResourceManager and the port is the port on<br>
=A0=A0=A0 which the clients can talk to the Resource Manager. &lt;/descript=
ion&gt;<br>=A0 &lt;/property&gt;<br><br>=A0 &lt;property&gt;<br>=A0=A0=A0 &=
lt;name&gt;yarn.nodemanager.local-dirs&lt;/name&gt;<br>=A0=A0=A0 &lt;value&=
gt;/opt/hadoop-2.0.4-alpha/temp/hadoop/yarn_nm_local_dir&lt;/value&gt;<br>
=A0=A0=A0 &lt;description&gt;the local directories used by the nodemanager&=
lt;/description&gt;<br>=A0 &lt;/property&gt;<br><br>=A0 &lt;property&gt;<br=
>=A0=A0=A0 &lt;name&gt;yarn.nodemanager.address&lt;/name&gt;<br>=A0=A0=A0 &=
lt;value&gt;<a href=3D"http://0.0.0.0:18050">0.0.0.0:18050</a>&lt;/value&gt=
;<br>
=A0=A0=A0 &lt;description&gt;the nodemanagers bind to this port&lt;/descrip=
tion&gt;<br>=A0 &lt;/property&gt;<br><br>=A0 &lt;property&gt;<br>=A0=A0=A0 =
&lt;name&gt;yarn.nodemanager.resource.memory-mb&lt;/name&gt;<br>=A0=A0=A0 &=
lt;value&gt;10240&lt;/value&gt;<br>
=A0=A0=A0 &lt;description&gt;the amount of memory on the NodeManager in GB&=
lt;/description&gt;<br>=A0 &lt;/property&gt;<br><br>=A0 &lt;property&gt;<br=
>=A0=A0=A0 &lt;name&gt;yarn.nodemanager.remote-app-log-dir&lt;/name&gt;<br>=
=A0=A0=A0 &lt;value&gt;/opt/hadoop-2.0.4-alpha/temp/hadoop/yarn_nm_app-logs=
&lt;/value&gt;<br>
=A0=A0=A0 &lt;description&gt;directory on hdfs where the application logs a=
re moved to &lt;/description&gt;<br>=A0 &lt;/property&gt;<br><br>=A0=A0 &lt=
;property&gt;<br>=A0=A0=A0 &lt;name&gt;yarn.nodemanager.log-dirs&lt;/name&g=
t;<br>=A0=A0=A0 &lt;value&gt;/opt/hadoop-2.0.4-alpha/temp/hadoop/yarn_nm_lo=
g&lt;/value&gt;<br>
=A0=A0=A0 &lt;description&gt;the directories used by Nodemanagers as log di=
rectories&lt;/description&gt;<br>=A0 &lt;/property&gt;<br><br>=A0 &lt;prope=
rty&gt;<br>=A0=A0=A0 &lt;name&gt;yarn.nodemanager.aux-services&lt;/name&gt;=
<br>=A0=A0=A0 &lt;value&gt;mapreduce.shuffle&lt;/value&gt;<br>
=A0=A0=A0 &lt;description&gt;shuffle service that needs to be set for Map R=
educe to run &lt;/description&gt;<br>=A0 &lt;/property&gt;<br><br>=A0 &lt;p=
roperty&gt;<br>=A0=A0=A0 &lt;name&gt;yarn.resourcemanager.client.thread-cou=
nt&lt;/name&gt;<br>
=A0=A0=A0 &lt;value&gt;64&lt;/value&gt;<br>=A0 &lt;/property&gt;<br><br>=A0=
&lt;property&gt;<br>=A0=A0=A0 &lt;name&gt;yarn.nodemanager.resource.cpu-cor=
es&lt;/name&gt;<br>=A0=A0=A0 &lt;value&gt;24&lt;/value&gt;<br>=A0 &lt;/prop=
erty&gt;<br><br>&lt;property&gt;<br>
=A0=A0=A0 &lt;name&gt;yarn.nodemanager.vcores-pcores-ratio&lt;/name&gt;<br>=
=A0=A0=A0 &lt;value&gt;3&lt;/value&gt;<br>=A0 &lt;/property&gt;<br><br>=A0&=
lt;property&gt;<br>=A0=A0=A0 &lt;name&gt;yarn.nodemanager.resource.memory-m=
b&lt;/name&gt;<br>
=A0=A0=A0 &lt;value&gt;22000&lt;/value&gt;<br>=A0 &lt;/property&gt;<br><br>=
=A0&lt;property&gt;<br>=A0=A0=A0 &lt;name&gt;yarn.nodemanager.vmem-pmem-rat=
io&lt;/name&gt;<br>=A0=A0=A0 &lt;value&gt;2.1&lt;/value&gt;<br>=A0 &lt;/pro=
perty&gt;<br><br></div>
<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">2013/6/7 Hars=
h J <span dir=3D"ltr">&lt;<a href=3D"mailto:harsh@cloudera.com" target=3D"_=
blank">harsh@cloudera.com</a>&gt;</span><br><blockquote class=3D"gmail_quot=
e" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204)=
;padding-left:1ex">
Not tuning configurations at all is wrong. YARN uses memory resource<br>
based scheduling and hence MR2 would be requesting 1 GB minimum by<br>
default, causing, on base configs, to max out at 8 (due to 8 GB NM<br>
memory resource config) total containers. Do share your configs as at<br>
this point none of us can tell what it is.<br>
<br>
Obviously, it isn&#39;t our goal to make MR2 slower for users and to not<br=
>
care about such things :)<br>
<div class=3D""><div class=3D"h5"><br>
On Fri, Jun 7, 2013 at 8:45 AM, sam liu &lt;<a href=3D"mailto:samliuhadoop@=
gmail.com">samliuhadoop@gmail.com</a>&gt; wrote:<br>
&gt; At the begining, I just want to do a fast comparision of MRv1 and Yarn=
. But<br>
&gt; they have many differences, and to be fair for comparison I did not tu=
ne<br>
&gt; their configurations at all. =A0So I got above test results. After ana=
lyzing<br>
&gt; the test result, no doubt, I will configure them and do comparison aga=
in.<br>
&gt;<br>
&gt; Do you have any idea on current test result? I think, to compare with =
MRv1,<br>
&gt; Yarn is better on Map phase(teragen test), but worse on Reduce<br>
&gt; phase(terasort test).<br>
&gt; And any detailed suggestions/comments/materials on Yarn performance tu=
nning?<br>
&gt;<br>
&gt; Thanks!<br>
&gt;<br>
&gt;<br>
&gt; 2013/6/7 Marcos Luis Ortiz Valmaseda &lt;<a href=3D"mailto:marcosluis2=
186@gmail.com">marcosluis2186@gmail.com</a>&gt;<br>
&gt;&gt;<br>
&gt;&gt; Why not to tune the configurations?<br>
&gt;&gt; Both frameworks have many areas to tune:<br>
&gt;&gt; - Combiners, Shuffle optimization, Block size, etc<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; 2013/6/6 sam liu &lt;<a href=3D"mailto:samliuhadoop@gmail.com">sam=
liuhadoop@gmail.com</a>&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Hi Experts,<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; We are thinking about whether to use Yarn or not in the near f=
uture, and<br>
&gt;&gt;&gt; I ran teragen/terasort on Yarn and MRv1 for comprison.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; My env is three nodes cluster, and each node has similar hardw=
are: 2<br>
&gt;&gt;&gt; cpu(4 core), 32 mem. Both Yarn and MRv1 cluster are set on the=
 same env. To<br>
&gt;&gt;&gt; be fair, I did not make any performance tuning on their config=
urations, but<br>
&gt;&gt;&gt; use the default configuration values.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Before testing, I think Yarn will be much better than MRv1, if=
 they all<br>
&gt;&gt;&gt; use default configuration, because Yarn is a better framework =
than MRv1.<br>
&gt;&gt;&gt; However, the test result shows some differences:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; MRv1: Hadoop-1.1.1<br>
&gt;&gt;&gt; Yarn: Hadoop-2.0.4<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; (A) Teragen: generate 10 GB data:<br>
&gt;&gt;&gt; - MRv1: 193 sec<br>
&gt;&gt;&gt; - Yarn: 69 sec<br>
&gt;&gt;&gt; Yarn is 2.8 times better than MRv1<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; (B) Terasort: sort 10 GB data:<br>
&gt;&gt;&gt; - MRv1: 451 sec<br>
&gt;&gt;&gt; - Yarn: 1136 sec<br>
&gt;&gt;&gt; Yarn is 2.5 times worse than MRv1<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; After a fast analysis, I think the direct cause might be that =
Yarn is<br>
&gt;&gt;&gt; much faster than MRv1 on Map phase, but much worse on Reduce p=
hase.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Here I have two questions:<br>
&gt;&gt;&gt; - Why my tests shows Yarn is worse than MRv1 for terasort?<br>
&gt;&gt;&gt; - What&#39;s the stratage for tuning Yarn performance? Is any =
materials?<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Thanks!<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; --<br>
&gt;&gt; Marcos Ortiz Valmaseda<br>
&gt;&gt; Product Manager at PDVSA<br>
&gt;&gt; <a href=3D"http://about.me/marcosortiz" target=3D"_blank">http://a=
bout.me/marcosortiz</a><br>
&gt;&gt;<br>
&gt;<br>
<br>
<br>
<br>
</div></div><span class=3D""><font color=3D"#888888">--<br>
Harsh J<br>
</font></span></blockquote></div><br></div></div>

--047d7b6042be9aa50f04de899b29--