Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <CAL3=Dw2P9DFXLdUjGVSHvDs4ZpDyiwBXdO0AoSJ1DFR7uShVLw@mail.gmail.com>
References: 
 <CAL3=Dw2yokDnza43H6Dd6v3c1wOaZdoVZpSZ1+YgsVLatDjYvA@mail.gmail.com>
	<CC6D39E1.CF9B%clehene@adobe.com>
	<CAF-umFNiybKjBy4SXiKL6WELd6Av126eDCo9ARsJYbRqUHaqoA@mail.gmail.com>
	<BLU0-SMTP25244738DF94E04B36BD788FAF0@phx.gbl>
	<CAL3=Dw2P9DFXLdUjGVSHvDs4ZpDyiwBXdO0AoSJ1DFR7uShVLw@mail.gmail.com>
Date: Mon, 10 Sep 2012 10:40:55 +0100
Message-ID: 
 <CA+4kjVvrbUnQnXjobt6aHo__YL_zq9+qMS=C10xF=tZXb2Af1A@mail.gmail.com>
Subject: Re: One petabyte of data loading into HDFS with in 10 min.
From: Steve Loughran <stevel@hortonworks.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf3071c6ca10767004c955c168

--20cf3071c6ca10767004c955c168
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 10 September 2012 08:40, prabhu K <prabhu.hadoop@gmail.com> wrote:

> Hi Users,
>
> Thanks for the response.
>
>
> We have loaded 100GB data loaded into HDFS, time taken 1hr.with below
> configuration.
>
> Each Node (1 machine master, 2 machines  are slave)
>
> 1.    500 GB hard disk.
>
> 2.    4Gb RAM
>
> 3.    3 quad code CPUs.
>
> 4.    Speed 1333 MHz
>
>
>
> Now, we are planning to load 1 petabyte of data (single file)  into
> Hadoop HDFS and Hive table within 10-20 minutes. For this we need a
> clarification below.
>
> 1. what are the system configuration setup required for all the 3
> machine=E2=80=99s ?.
>

2. Hard disk size.
>

At least a petabyte, maybe three.

If you were planning to do some pre-storage processing, such as filter or
compress the data, to it before the upload.


> 3. RAM size.
>
> 4. Mother board
>
> 5. Network cable
>


> 6. How much Gbps  Infiniband required.
>
>
yes.


>  For the same setup we need cloud computing environment too?
>
> Please suggest and help me on this.
>
>  Thanks,
>

Prabhu, I don't think you've been reading the replies fully.

The data rate coming off the filtered Cern LHC experiments is 1.6 PB/month.
Your "10 minute" upload is trying to handle two weeks' worth of CERN data
in a fraction of time.

Nobody can seriously point to your questions and say "this is the
motherboard you need" as your project seems to have some unrealistic goals.
If you do want to do a 1PB upload in 10 minutes -or even, say 30-60
minutes, the first actions in your project should be


   1. Come up with some realistic deliverables rather than a a vague "1
   PB/10 minute" requirements.
   2. Include a realistic timetable as part of those deliverables.
   3. Look at the data source(s) and work out how fast they can actually
   generate data off their hard disks, out of their database, or whatever.
   That's your maximum bandwidth irrespective of what you do with the data
   afterwards.
   4. Hire someone who knows about these problems and how to solve them -or
   who at least is respected enough that  when they say "you need realistic
   goals" they'd be believed.

Someone could set up a network to transfer 1 PB of data into a Hadoop
cluster in 10 Minutes, but it would be a bleeding edge exercise you'd end
up writing papers about in VLDB or similar conferences.

The cost of doing so would be utterly excessive unless you were planning to
load (and then hopefully, discard) another PB in the next 10 minutes -and
again, repeatedly. Otherwise you would be paying massive amounts for
network bandwidth that would only ever be using for ten minutes.

Asking for help on the -user list isn't going to solve your problems, as
the "1 PB in 10 minutes" goal is the problem. Do you really need all that
data? In 10 minutes? IF so, then you're going to have to find someone who
really, really knows about networking, disk IO bandwidth, cluster
commissioning, etc. I'm not volunteering. I may have some colleagues you
could talk to, but that -as with other people on this list- would be in the
category of action 5, "pay for consultancy"

Sorry.

--20cf3071c6ca10767004c955c168
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<br><br><div class=3D"gmail_quote">On 10 September 2012 08:40, prabhu K <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:prabhu.hadoop@gmail.com" target=3D"_bl=
ank">prabhu.hadoop@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex">
<div>Hi Users,</div>
<div>=C2=A0</div>
<div>Thanks for the response.</div>
<div>=C2=A0</div>
<div>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">We have loaded 100GB data loaded into HDFS, time taken 1h=
r.with below configuration.</font></font></p>
<p style=3D"TEXT-INDENT:0.5in;MARGIN:0in 0in 10pt" class=3D"MsoNormal"><spa=
n style><font size=3D"3"><font face=3D"Calibri">Each Node (1 machine master=
, 2 machines <span>=C2=A0</span>are slave)</font></font></span></p>

<p style=3D"LINE-HEIGHT:normal;MARGIN:0in 0in 0pt 0.75in"><span style><span=
><font face=3D"Trebuchet MS">1.</font><span style=3D"FONT:7pt &#39;Times Ne=
w Roman&#39;">=C2=A0=C2=A0=C2=A0 </span></span></span><span style><font fac=
e=3D"Trebuchet MS">500 GB hard disk.</font></span></p>


<p style=3D"LINE-HEIGHT:normal;MARGIN:0in 0in 0pt 0.75in"><span style><span=
><font face=3D"Trebuchet MS">2.</font><span style=3D"FONT:7pt &#39;Times Ne=
w Roman&#39;">=C2=A0=C2=A0=C2=A0 </span></span></span><span style><font fac=
e=3D"Trebuchet MS">4Gb RAM</font></span></p>


<p style=3D"LINE-HEIGHT:normal;MARGIN:0in 0in 0pt 0.75in"><span style><span=
><font face=3D"Trebuchet MS">3.</font><span style=3D"FONT:7pt &#39;Times Ne=
w Roman&#39;">=C2=A0=C2=A0=C2=A0 </span></span></span><span style><font fac=
e=3D"Trebuchet MS">3 quad code CPUs.</font></span></p>


<p style=3D"LINE-HEIGHT:normal;MARGIN:0in 0in 0pt 0.75in"><span style><span=
><font face=3D"Trebuchet MS">4.</font><span style=3D"FONT:7pt &#39;Times Ne=
w Roman&#39;">=C2=A0=C2=A0=C2=A0 </span></span></span><span style><font fac=
e=3D"Trebuchet MS">Speed 1333 MHz</font></span></p>


<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3" face=
=3D"Calibri">=C2=A0</font></p>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">Now, we are planning to load 1 petabyte of data (single f=
ile) <span>=C2=A0</span>into Hadoop HDFS and Hive table within 10-20 minute=
s. For this we need a clarification below.</font></font></p>


<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">1. what are the system configuration setup required for a=
ll the 3 machine=E2=80=99s ?.</font></font></p></div></blockquote><div><br>=
</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l=
eft:1px #ccc solid;padding-left:1ex">
<div>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">2. Hard disk size.</font></font></p></div></blockquote><d=
iv><br></div><div>At least a petabyte, maybe three.=C2=A0</div><div><br></d=
iv><div>
If you were planning to do some pre-storage processing, such as filter or c=
ompress the data, to it before the upload.</div><div>=C2=A0</div><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex">
<div>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">3. RAM size.</font></font></p>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">4. Mother board</font></font></p>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">5. Network cable</font></font></p></div></blockquote><div=
>=C2=A0=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3">=
<font face=3D"Calibri">6. How much Gbps <span>=C2=A0</span>Infiniband requi=
red.</font></font></p>
<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3" face=
=3D"Calibri"></font></p></div></blockquote><div><br></div><div>yes.=C2=A0</=
div><div><br></div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3" =
face=3D"Calibri">=C2=A0</font><font size=3D"3"><font face=3D"Calibri">For t=
he same setup we need cloud computing environment too?</font></font></p><di=
v class=3D"im">

<p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"><font=
 face=3D"Calibri">Please suggest and help me on this.</font></font></p>
</div><p style=3D"MARGIN:0in 0in 10pt" class=3D"MsoNormal"><font size=3D"3"=
 face=3D"Calibri">=C2=A0Thanks,</font></p></div></blockquote><div><br></div=
><div>Prabhu, I don&#39;t think you&#39;ve been reading the replies fully.<=
/div><div>
<br></div><div>The data rate coming off the filtered Cern LHC experiments i=
s 1.6 PB/month. Your &quot;10 minute&quot; upload is trying to handle two w=
eeks&#39; worth of CERN data in a fraction of time.</div><div><br></div>
<div>Nobody can seriously point to your questions and say &quot;this is the=
 motherboard you need&quot; as your project seems to have some unrealistic =
goals. If you do want to do a 1PB upload in 10 minutes -or even, say 30-60 =
minutes, the first actions in your project should be</div>
<div><br></div><div><ol><li>Come up with some realistic deliverables rather=
 than a a vague &quot;1 PB/10 minute&quot; requirements.</li><li>Include a =
realistic timetable as part of those deliverables.</li><li>Look at the data=
 source(s) and work out how fast they can actually generate data off their =
hard disks, out of their database, or whatever. That&#39;s your maximum ban=
dwidth irrespective of what you do with the data afterwards.</li>
<li>Hire someone who knows about these problems and how to solve them -or w=
ho at least is respected enough that =C2=A0when they say &quot;you need rea=
listic goals&quot; they&#39;d be believed.</li></ol><div>Someone could set =
up a network to transfer 1 PB of data into a Hadoop cluster in 10 Minutes, =
but it would be a bleeding edge exercise you&#39;d end up writing papers ab=
out in VLDB or similar conferences.=C2=A0</div>
<div><br></div><div>The cost of doing so would be utterly excessive unless =
you were planning to load (and then hopefully, discard) another PB in the n=
ext 10 minutes -and again, repeatedly. Otherwise you would be paying massiv=
e amounts for network bandwidth that would only ever be using for ten minut=
es.=C2=A0</div>
<div><br></div><div>Asking for help on the -user list isn&#39;t going to so=
lve your problems, as the &quot;1 PB in 10 minutes&quot; goal is the proble=
m. Do you really need all that data? In 10 minutes? IF so, then you&#39;re =
going to have to find someone who really, really knows about networking, di=
sk IO bandwidth, cluster commissioning, etc. I&#39;m not volunteering. I ma=
y have some colleagues you could talk to, but that -as with other people on=
 this list- would be in the category of action 5, &quot;pay for consultancy=
&quot;</div>
<div><br></div><div>Sorry.=C2=A0</div><div><br></div></div></div>

--20cf3071c6ca10767004c955c168--