Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of prabhu.hadoop@gmail.com
 designates 209.85.223.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <BLU0-SMTP25244738DF94E04B36BD788FAF0@phx.gbl>
References: 
 <CAL3=Dw2yokDnza43H6Dd6v3c1wOaZdoVZpSZ1+YgsVLatDjYvA@mail.gmail.com>
	<CC6D39E1.CF9B%clehene@adobe.com>
	<CAF-umFNiybKjBy4SXiKL6WELd6Av126eDCo9ARsJYbRqUHaqoA@mail.gmail.com>
	<BLU0-SMTP25244738DF94E04B36BD788FAF0@phx.gbl>
Date: Mon, 10 Sep 2012 13:10:04 +0530
Message-ID: 
 <CAL3=Dw2P9DFXLdUjGVSHvDs4ZpDyiwBXdO0AoSJ1DFR7uShVLw@mail.gmail.com>
Subject: Re: One petabyte of data loading into HDFS with in 10 min.
From: prabhu K <prabhu.hadoop@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=e89a8f3bac1de8707c04c9541006

--e89a8f3bac1de8707c04c9541006
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Hi Users,

Thanks for the response.


We have loaded 100GB data loaded into HDFS, time taken 1hr.with below
configuration.

Each Node (1 machine master, 2 machines  are slave)

1.    500 GB hard disk.

2.    4Gb RAM

3.    3 quad code CPUs.

4.    Speed 1333 MHz


Now, we are planning to load 1 petabyte of data (single file)  into Hadoop
HDFS and Hive table within 10-20 minutes. For this we need a clarification
below.

1. what are the system configuration setup required for all the 3 machine=
=92s
?.

2. Hard disk size.

3. RAM size.

4. Mother board

5. Network cable

6. How much Gbps  Infiniband required.

 For the same setup we need cloud computing environment too?

Please suggest and help me on this.

 Thanks,

Prabhu.
On Fri, Sep 7, 2012 at 7:30 PM, Michael Segel <michael_segel@hotmail.com>wr=
ote:

> Sorry, but you didn't account for the network saturation.
>
> And why 1GBe and not 10GBe? Also which version of hadoop?
>
> Here MapR works well with bonding two 10GBe ports and with the right
> switch, you could do ok.
> Also 2 ToR switches... per rack. etc...
>
> How many machines? 150? 300? more?
>
> Then you don't talk about how much memory, CPUs, what type of storage...
>
> Lots of factors.
>
> I'm sorry to interrupt this mental masturbation about how to load 1PB in
> 10min.
> There is a lot more questions that should be asked that weren't.
>
> Hey but look. Its a Friday, so I suggest some pizza, beer and then take i=
t
> to a white board.
>
> But what do I know? In a different thread, I'm talking about how to tame
> HR and Accounting so they let me play with my team Ninja!
> :-P
>
> On Sep 5, 2012, at 9:56 AM, zGreenfelder <zgreenfelder@gmail.com> wrote:
>
> > On Wed, Sep 5, 2012 at 10:43 AM, Cosmin Lehene <clehene@adobe.com>
> wrote:
> >> Here's an extremely na=EFve ballpark estimation: at theoretical hardwa=
re
> >> speed, for 3PB representing 1PB with 3x replication
> >>
> >> Over a single 1Gbps connection (and I'm not sure, you can actually rea=
ch
> >> 1Gbps)
> >> (3 petabytes) / (1 Gbps) =3D 291.271111 days
> >>
> >> So you'd need at least 40,000 1Gbps network cards to get that in 10
> minutes
> >> :) - (3PB/1Gbps)/40000
> >>
> >> The actual number of nodes would depend a lot on the actual network
> >> architecture, the type of storage you use (SSD,  HDD), etc.
> >>
> >> Cosmin
> >
> > ah, I went te other direction with the math, and assumed no
> > replication (completely unsafe and never reasonable for a real,
> > production environment, but since we're all theory and just looking
> > for starting point numbers)
> >
> >
> > 1PB in 10 min =3D=3D
> > 1,000,000gB in 10 min =3D=3D
> > 8,000,000gb in 600 seconds =3D=3D
> >
> > 80,000/6  ~=3D 14k machines running at gigabit or about 1.5k machines i=
f
> you
> > get 10Gb connected machines.
> >
> > all assuming there's no network or cluster sync overhead
> > (of course there would be)
> >
> >
> > that seems like some pretty deep pockets to get to < 10 minute load
> > time for that much data.
> >
> > I could also be off, I just threw some stuff together somewhat
> > quickly.between conf calls.
> >
> > --
> > Even the Magic 8 ball has an opinion on email clients: Outlook not so
> good.
> >
>
>

--e89a8f3bac1de8707c04c9541006
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

<div>Hi Users,</div>
<div>=A0</div>
<div>Thanks for the response.</div>
<div>=A0</div>
<div>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">We have loaded 100GB data loaded into HDFS, time taken 1h=
r.with below configuration.</font></font></p>
<p style=3D"TEXT-INDENT:0.5in;MARGIN:0in 0in 10pt" class=3D"MsoNormal"><spa=
n style=3D"COLOR:black"><font size=3D"3"><font face=3D"Calibri">Each Node (=
1 machine master, 2 machines <span style>=A0</span>are slave)</font></font>=
</span></p>

<p style=3D"LINE-HEIGHT:normal;MARGIN:0in 0in 0pt 0.75in" class=3D"MsoListP=
aragraph"><span style=3D"COLOR:black"><span style><font face=3D"Trebuchet M=
S">1.</font><span style=3D"FONT:7pt &#39;Times New Roman&#39;">=A0=A0=A0 </=
span></span></span><span style=3D"COLOR:black"><font face=3D"Trebuchet MS">=
500 GB hard disk.</font></span></p>

<p style=3D"LINE-HEIGHT:normal;MARGIN:0in 0in 0pt 0.75in" class=3D"MsoListP=
aragraph"><span style=3D"COLOR:black"><span style><font face=3D"Trebuchet M=
S">2.</font><span style=3D"FONT:7pt &#39;Times New Roman&#39;">=A0=A0=A0 </=
span></span></span><span style=3D"COLOR:black"><font face=3D"Trebuchet MS">=
4Gb RAM</font></span></p>

<p style=3D"LINE-HEIGHT:normal;MARGIN:0in 0in 0pt 0.75in" class=3D"MsoListP=
aragraph"><span style=3D"COLOR:black"><span style><font face=3D"Trebuchet M=
S">3.</font><span style=3D"FONT:7pt &#39;Times New Roman&#39;">=A0=A0=A0 </=
span></span></span><span style=3D"COLOR:black"><font face=3D"Trebuchet MS">=
3 quad code CPUs.</font></span></p>

<p style=3D"LINE-HEIGHT:normal;MARGIN:0in 0in 0pt 0.75in" class=3D"MsoListP=
aragraph"><span style=3D"COLOR:black"><span style><font face=3D"Trebuchet M=
S">4.</font><span style=3D"FONT:7pt &#39;Times New Roman&#39;">=A0=A0=A0 </=
span></span></span><span style=3D"COLOR:black"><font face=3D"Trebuchet MS">=
Speed 1333 MHz</font></span></p>

<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3" face=
=3D"Calibri">=A0</font></p>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">Now, we are planning to load 1 petabyte of data (single f=
ile) <span style>=A0</span>into Hadoop HDFS and Hive table within 10-20 min=
utes. For this we need a clarification below.</font></font></p>

<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">1. what are the system configuration setup required for a=
ll the 3 machine=92s ?.</font></font></p>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">2. Hard disk size.</font></font></p>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">3. RAM size.</font></font></p>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">4. Mother board</font></font></p>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">5. Network cable</font></font></p>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">6. How much Gbps <span style>=A0</span>Infiniband require=
d.</font></font></p>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3" face=
=3D"Calibri">=A0</font><font size=3D"3"><font face=3D"Calibri">For the same=
 setup we need cloud computing environment too?</font></font></p>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">Please suggest and help me on this.</font></font></p>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3" face=
=3D"Calibri">=A0Thanks,</font></p>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3" face=
=3D"Calibri">Prabhu.</font><br></p></div>
<div class=3D"gmail_quote">On Fri, Sep 7, 2012 at 7:30 PM, Michael Segel <s=
pan dir=3D"ltr">&lt;<a href=3D"mailto:michael_segel@hotmail.com" target=3D"=
_blank">michael_segel@hotmail.com</a>&gt;</span> wrote:<br>
<blockquote style=3D"BORDER-LEFT:#ccc 1px solid;MARGIN:0px 0px 0px 0.8ex;PA=
DDING-LEFT:1ex" class=3D"gmail_quote">Sorry, but you didn&#39;t account for=
 the network saturation.<br><br>And why 1GBe and not 10GBe? Also which vers=
ion of hadoop?<br>
<br>Here MapR works well with bonding two 10GBe ports and with the right sw=
itch, you could do ok.<br>Also 2 ToR switches... per rack. etc...<br><br>Ho=
w many machines? 150? 300? more?<br><br>Then you don&#39;t talk about how m=
uch memory, CPUs, what type of storage...<br>
<br>Lots of factors.<br><br>I&#39;m sorry to interrupt this mental masturba=
tion about how to load 1PB in 10min.<br>There is a lot more questions that =
should be asked that weren&#39;t.<br><br>Hey but look. Its a Friday, so I s=
uggest some pizza, beer and then take it to a white board.<br>
<br>But what do I know? In a different thread, I&#39;m talking about how to=
 tame HR and Accounting so they let me play with my team Ninja!<br>:-P<br>
<div class=3D"HOEnZb">
<div class=3D"h5"><br>On Sep 5, 2012, at 9:56 AM, zGreenfelder &lt;<a href=
=3D"mailto:zgreenfelder@gmail.com">zgreenfelder@gmail.com</a>&gt; wrote:<br=
><br>&gt; On Wed, Sep 5, 2012 at 10:43 AM, Cosmin Lehene &lt;<a href=3D"mai=
lto:clehene@adobe.com">clehene@adobe.com</a>&gt; wrote:<br>
&gt;&gt; Here&#39;s an extremely na=EFve ballpark estimation: at theoretica=
l hardware<br>&gt;&gt; speed, for 3PB representing 1PB with 3x replication<=
br>&gt;&gt;<br>&gt;&gt; Over a single 1Gbps connection (and I&#39;m not sur=
e, you can actually reach<br>
&gt;&gt; 1Gbps)<br>&gt;&gt; (3 petabytes) / (1 Gbps) =3D 291.271111 days<br=
>&gt;&gt;<br>&gt;&gt; So you&#39;d need at least 40,000 1Gbps network cards=
 to get that in 10 minutes<br>&gt;&gt; :) - (3PB/1Gbps)/40000<br>&gt;&gt;<b=
r>
&gt;&gt; The actual number of nodes would depend a lot on the actual networ=
k<br>&gt;&gt; architecture, the type of storage you use (SSD, =A0HDD), etc.=
<br>&gt;&gt;<br>&gt;&gt; Cosmin<br>&gt;<br>&gt; ah, I went te other directi=
on with the math, and assumed no<br>
&gt; replication (completely unsafe and never reasonable for a real,<br>&gt=
; production environment, but since we&#39;re all theory and just looking<b=
r>&gt; for starting point numbers)<br>&gt;<br>&gt;<br>&gt; 1PB in 10 min =
=3D=3D<br>
&gt; 1,000,000gB in 10 min =3D=3D<br>&gt; 8,000,000gb in 600 seconds =3D=3D=
<br>&gt;<br>&gt; 80,000/6 =A0~=3D 14k machines running at gigabit or about =
1.5k machines if you<br>&gt; get 10Gb connected machines.<br>&gt;<br>&gt; a=
ll assuming there&#39;s no network or cluster sync overhead<br>
&gt; (of course there would be)<br>&gt;<br>&gt;<br>&gt; that seems like som=
e pretty deep pockets to get to &lt; 10 minute load<br>&gt; time for that m=
uch data.<br>&gt;<br>&gt; I could also be off, I just threw some stuff toge=
ther somewhat<br>
&gt; quickly.between conf calls.<br>&gt;<br>&gt; --<br>&gt; Even the Magic =
8 ball has an opinion on email clients: Outlook not so good.<br>&gt;<br><br=
></div></div></blockquote></div><br>

--e89a8f3bac1de8707c04c9541006--