Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of ranjith.raghunath1@gmail.com
 designates 209.85.223.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <BLU0-SMTP42097B0F90B420E53BAF4DB8F730@phx.gbl>
References: 
 <CAMWNowFfkbCRw9BsiDJZHPFRtL0Vgj5xx+xbcH=AVU6w2qk=vw@mail.gmail.com>
	<029C75A3482BE64594E21FC09BB19F411386C299@BY2PRD0711MB428.namprd07.prod.outlook.com>
	<71707C6AD2C02B4087F9E1BCA9EC816513C9E1A2C1@exch-mbx-112.vmware.com>
	<BLU0-SMTP182B7F01EFDC49CCA6183178F730@phx.gbl>
	<CAOp82TNYYE-=A7GNEjz9j_-aRv3um+Am1j2cJ3nHw3SfoFErsA@mail.gmail.com>
	<BLU0-SMTP42097B0F90B420E53BAF4DB8F730@phx.gbl>
Date: Fri, 12 Oct 2012 22:27:50 -0500
Message-ID: 
 <CAOp82TNaa+no9Pmu+xPv3ijfXGc1FwTweHpNVSjtuZTj3e6hhw@mail.gmail.com>
Subject: Re: Spindle per Cores
From: ranjith raghunath <ranjith.raghunath1@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=f46d0447850d98d87204cbe863a4

--f46d0447850d98d87204cbe863a4
Content-Type: text/plain; charset=ISO-8859-1

Thanks Michael.
On Oct 12, 2012 9:59 PM, "Michael Segel" <michael_segel@hotmail.com> wrote:

> I think what we are seeing is the ratio based on physical Xeon cores.
> So hyper threading wouldn't make any change to  the actual ratio.
> (1 disk per physical core, would be 1 disk per 2 virtual cores.)
>
> Again YMMV and of course thanks to this guy Moore who decided to write
> some weird laws... the ratio could change over time as the CPUs become more
> efficient and faster.
>
>
> On Oct 12, 2012, at 9:52 PM, ranjith raghunath <
> ranjith.raghunath1@gmail.com> wrote:
>
> Does hypertheading affect this ratio?
> On Oct 12, 2012 9:36 PM, "Michael Segel" <michael_segel@hotmail.com>
> wrote:
>
>> First, the obvious caveat... YMMV
>>
>> Having said that.
>>
>> The key here is to take a look across the various jobs that you will run.
>> Some may be more CPU intensive, others more I/O intensive.
>>
>> If you monitor these jobs via Ganglia, when you have too few spindles you
>> should see the wait cpu rise on the machines in the cluster.  That is to
>> say that you are putting an extra load on the systems because you're
>> waiting for the disks to catch up.
>>
>> If you increase the ratio of disks to CPU, you should see that load drop
>> as you are not wasting CPU cycles.
>>
>> Note that its not just the number of spindles, but also the bus and the
>> controller cards that can also affect the throughput of disk I/O.
>>
>> Now just IMHO, there was a discussion on some of the CPU recommendations.
>> To a point, it doesn't matter that much. You want to maximize the bang for
>> the buck you can get w your hardware purchase.
>>
>> Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core,
>> and you're wasting the cpu that you bought.
>>
>> Going higher than a ratio of 1, like 1.5, and you may be buying too many
>> spindles and not see a performance gain that offsets your cost.
>>
>> Search for a happy medium and don't sweat the maximum performance that
>> you may get.
>>
>> HTH
>>
>> On Oct 12, 2012, at 4:19 PM, Jeffrey Buell <jbuell@vmware.com> wrote:
>>
>> > I've done some experiments along these lines.  I'm using
>> high-performance 15K RPM SAS drives instead of the more usual SATA drives,
>> which should reduce the number of drives I need.  I have dual 4-core
>> processors at 3.6 GHz.  These are more powerful than the average 4-core
>> processor, which should increase the number of drives I need.  Assuming
>> these 2 effects cancel, then my results should also apply to machines with
>> SATA drives and average processors.  Using 8 drives (1-1) gets good
>> performance for teragen and terasort.  Going to 12 drives (1.5 per core)
>> increases terasort performance by 15%.  That might not seem like much
>> compared to increasing the number of drives by 50%, but a better comparison
>> is that 4 extra drives increased the cost of each machine by only about
>> 12%, so the extra drives are (barely) worth it. If you're more time
>> sensitive than cost sensitive, they they're definitely worth it.  The extra
>> drives did not help teragen, apparently because both CPU and the internal
>> storage controller were close to saturation. So, of course everything
>> depends on the app.  You're shooting for saturated CPUs and disk bandwidth.
>>  Check that the CPU is not saturated (after checking Hadoop tuning and
>> optimizing the number of tasks). Check that you have enough memory for more
>> tasks with room leftover for a large buffer cache.  Use 10 GbE networking
>> or make sure the network has enough headroom.  Check the storage controller
>> can handle more bandwidth.  If all are true (that is, no other
>> bottlenecks), consider adding more drives.
>> >
>> > Jeff
>> >
>> >> -----Original Message-----
>> >> From: Hank Cohen [mailto:hank.cohen@altior.com]
>> >> Sent: Friday, October 12, 2012 1:46 PM
>> >> To: user@hadoop.apache.org
>> >> Subject: RE: Spindle per Cores
>> >>
>> >> What empirical evidence is there for this rule of thumb?
>> >> In other words, what tests or metrics would indicate an optimal
>> >> spindle/core ratio and how dependent is this on the nature of the data
>> >> and of the map/reduce computation?
>> >>
>> >> My understanding is that there are lots of clusters with more spindles
>> >> than cores.  Specifically, typical 2U servers can hold 12 3.5" disk
>> >> drives.  So lots of Hadoop clusters have dual 4 core processors and 12
>> >> spindles.  Would it be better to have 6 core processors if you are
>> >> loading up the boxes with 12 disks?  And most importantly, how would
>> >> one know that the mix was optimal?
>> >>
>> >> Hank Cohen
>> >> Altior Inc.
>> >>
>> >> -----Original Message-----
>> >> From: Patai Sangbutsarakum [mailto:silvianhadoop@gmail.com]
>> >> Sent: Friday, October 12, 2012 10:46 AM
>> >> To: user@hadoop.apache.org
>> >> Subject: Spindle per Cores
>> >>
>> >> I have read around about the hardware recommendation for hadoop
>> >> cluster.
>> >> One of them is recommend 1:1 ratio between spindle per core.
>> >>
>> >> Intel CPU come with Hyperthread which will double the number cores on
>> >> one physical CPU. eg. 8 cores with Hyperthread it because 16 which is
>> >> where we start to calculate about number of task slots per node.
>> >>
>> >> Once it come to spindle, i strongly believe I should pick 8 cores and
>> >> picks 8 disks in order to get 1:1 ratio.
>> >>
>> >> Please suggest
>> >> Patai
>> >>
>> >
>> >
>>
>>
>

--f46d0447850d98d87204cbe863a4
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<p>Thanks Michael.</p>
<div class=3D"gmail_quote">On Oct 12, 2012 9:59 PM, &quot;Michael Segel&quo=
t; &lt;<a href=3D"mailto:michael_segel@hotmail.com">michael_segel@hotmail.c=
om</a>&gt; wrote:<br type=3D"attribution"><blockquote class=3D"gmail_quote"=
 style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style=3D"word-wrap:break-word">I think what we are seeing is the ratio=
 based on physical Xeon cores.=A0<div>So hyper threading wouldn&#39;t make =
any change to =A0the actual ratio.</div><div>(1 disk per physical core, wou=
ld be 1 disk per 2 virtual cores.)=A0</div>
<div><br></div><div>Again YMMV and of course thanks to this guy Moore who d=
ecided to write some weird laws... the ratio could change over time as the =
CPUs become more efficient and faster.=A0</div><div><br></div><div><br></di=
v>
<div><div><div>On Oct 12, 2012, at 9:52 PM, ranjith raghunath &lt;<a href=
=3D"mailto:ranjith.raghunath1@gmail.com" target=3D"_blank">ranjith.raghunat=
h1@gmail.com</a>&gt; wrote:</div><br><blockquote type=3D"cite"><p>Does hype=
rtheading affect this ratio?</p>

<div class=3D"gmail_quote">On Oct 12, 2012 9:36 PM, &quot;Michael Segel&quo=
t; &lt;<a href=3D"mailto:michael_segel@hotmail.com" target=3D"_blank">micha=
el_segel@hotmail.com</a>&gt; wrote:<br type=3D"attribution"><blockquote cla=
ss=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pa=
dding-left:1ex">

First, the obvious caveat... YMMV<br>
<br>
Having said that.<br>
<br>
The key here is to take a look across the various jobs that you will run. S=
ome may be more CPU intensive, others more I/O intensive.<br>
<br>
If you monitor these jobs via Ganglia, when you have too few spindles you s=
hould see the wait cpu rise on the machines in the cluster. =A0That is to s=
ay that you are putting an extra load on the systems because you&#39;re wai=
ting for the disks to catch up.<br>


<br>
If you increase the ratio of disks to CPU, you should see that load drop as=
 you are not wasting CPU cycles.<br>
<br>
Note that its not just the number of spindles, but also the bus and the con=
troller cards that can also affect the throughput of disk I/O.<br>
<br>
Now just IMHO, there was a discussion on some of the CPU recommendations. T=
o a point, it doesn&#39;t matter that much. You want to maximize the bang f=
or the buck you can get w your hardware purchase.<br>
<br>
Use the ratio as a buying guide. Fewer than a ratio of 1 disk per core, and=
 you&#39;re wasting the cpu that you bought.<br>
<br>
Going higher than a ratio of 1, like 1.5, and you may be buying too many sp=
indles and not see a performance gain that offsets your cost.<br>
<br>
Search for a happy medium and don&#39;t sweat the maximum performance that =
you may get.<br>
<br>
HTH<br>
<br>
On Oct 12, 2012, at 4:19 PM, Jeffrey Buell &lt;<a href=3D"mailto:jbuell@vmw=
are.com" target=3D"_blank">jbuell@vmware.com</a>&gt; wrote:<br>
<br>
&gt; I&#39;ve done some experiments along these lines. =A0I&#39;m using hig=
h-performance 15K RPM SAS drives instead of the more usual SATA drives, whi=
ch should reduce the number of drives I need. =A0I have dual 4-core process=
ors at 3.6 GHz. =A0These are more powerful than the average 4-core processo=
r, which should increase the number of drives I need. =A0Assuming these 2 e=
ffects cancel, then my results should also apply to machines with SATA driv=
es and average processors. =A0Using 8 drives (1-1) gets good performance fo=
r teragen and terasort. =A0Going to 12 drives (1.5 per core) increases tera=
sort performance by 15%. =A0That might not seem like much compared to incre=
asing the number of drives by 50%, but a better comparison is that 4 extra =
drives increased the cost of each machine by only about 12%, so the extra d=
rives are (barely) worth it. If you&#39;re more time sensitive than cost se=
nsitive, they they&#39;re definitely worth it. =A0The extra drives did not =
help teragen, apparently because both CPU and the internal storage controll=
er were close to saturation. So, of course everything depends on the app. =
=A0You&#39;re shooting for saturated CPUs and disk bandwidth. =A0Check that=
 the CPU is not saturated (after checking Hadoop tuning and optimizing the =
number of tasks). Check that you have enough memory for more tasks with roo=
m leftover for a large buffer cache. =A0Use 10 GbE networking or make sure =
the network has enough headroom. =A0Check the storage controller can handle=
 more bandwidth. =A0If all are true (that is, no other bottlenecks), consid=
er adding more drives.<br>


&gt;<br>
&gt; Jeff<br>
&gt;<br>
&gt;&gt; -----Original Message-----<br>
&gt;&gt; From: Hank Cohen [mailto:<a href=3D"mailto:hank.cohen@altior.com" =
target=3D"_blank">hank.cohen@altior.com</a>]<br>
&gt;&gt; Sent: Friday, October 12, 2012 1:46 PM<br>
&gt;&gt; To: <a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">us=
er@hadoop.apache.org</a><br>
&gt;&gt; Subject: RE: Spindle per Cores<br>
&gt;&gt;<br>
&gt;&gt; What empirical evidence is there for this rule of thumb?<br>
&gt;&gt; In other words, what tests or metrics would indicate an optimal<br=
>
&gt;&gt; spindle/core ratio and how dependent is this on the nature of the =
data<br>
&gt;&gt; and of the map/reduce computation?<br>
&gt;&gt;<br>
&gt;&gt; My understanding is that there are lots of clusters with more spin=
dles<br>
&gt;&gt; than cores. =A0Specifically, typical 2U servers can hold 12 3.5&qu=
ot; disk<br>
&gt;&gt; drives. =A0So lots of Hadoop clusters have dual 4 core processors =
and 12<br>
&gt;&gt; spindles. =A0Would it be better to have 6 core processors if you a=
re<br>
&gt;&gt; loading up the boxes with 12 disks? =A0And most importantly, how w=
ould<br>
&gt;&gt; one know that the mix was optimal?<br>
&gt;&gt;<br>
&gt;&gt; Hank Cohen<br>
&gt;&gt; Altior Inc.<br>
&gt;&gt;<br>
&gt;&gt; -----Original Message-----<br>
&gt;&gt; From: Patai Sangbutsarakum [mailto:<a href=3D"mailto:silvianhadoop=
@gmail.com" target=3D"_blank">silvianhadoop@gmail.com</a>]<br>
&gt;&gt; Sent: Friday, October 12, 2012 10:46 AM<br>
&gt;&gt; To: <a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">us=
er@hadoop.apache.org</a><br>
&gt;&gt; Subject: Spindle per Cores<br>
&gt;&gt;<br>
&gt;&gt; I have read around about the hardware recommendation for hadoop<br=
>
&gt;&gt; cluster.<br>
&gt;&gt; One of them is recommend 1:1 ratio between spindle per core.<br>
&gt;&gt;<br>
&gt;&gt; Intel CPU come with Hyperthread which will double the number cores=
 on<br>
&gt;&gt; one physical CPU. eg. 8 cores with Hyperthread it because 16 which=
 is<br>
&gt;&gt; where we start to calculate about number of task slots per node.<b=
r>
&gt;&gt;<br>
&gt;&gt; Once it come to spindle, i strongly believe I should pick 8 cores =
and<br>
&gt;&gt; picks 8 disks in order to get 1:1 ratio.<br>
&gt;&gt;<br>
&gt;&gt; Please suggest<br>
&gt;&gt; Patai<br>
&gt;&gt;<br>
&gt;<br>
&gt;<br>
<br>
</blockquote></div>
</blockquote></div><br></div></div></blockquote></div>

--f46d0447850d98d87204cbe863a4--