Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0633FD1A0 for ; Mon, 10 Sep 2012 19:23:32 +0000 (UTC) Received: (qmail 40225 invoked by uid 500); 10 Sep 2012 19:23:26 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 40102 invoked by uid 500); 10 Sep 2012 19:23:26 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 40088 invoked by uid 99); 10 Sep 2012 19:23:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Sep 2012 19:23:26 +0000 X-ASF-Spam-Status: No, hits=3.2 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of siddharth.tiwari@live.com designates 65.54.61.90 as permitted sender) Received: from [65.54.61.90] (HELO snt0-omc2-s39.snt0.hotmail.com) (65.54.61.90) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Sep 2012 19:23:18 +0000 Received: from SNT142-W52 ([65.55.90.72]) by snt0-omc2-s39.snt0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Mon, 10 Sep 2012 12:22:57 -0700 Message-ID: Content-Type: multipart/alternative; boundary="_7505e3d2-5698-4ef3-b73a-e23377a16784_" X-Originating-IP: [180.151.41.42] From: Siddharth Tiwari To: USers Hadoop Subject: RE: One petabyte of data loading into HDFS with in 10 min. Date: Mon, 10 Sep 2012 19:22:57 +0000 Importance: Normal In-Reply-To: References: ,,,,,, MIME-Version: 1.0 X-OriginalArrivalTime: 10 Sep 2012 19:22:57.0693 (UTC) FILETIME=[AF3D3CD0:01CD8F89] X-Virus-Checked: Checked by ClamAV on apache.org --_7505e3d2-5698-4ef3-b73a-e23377a16784_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Well can't you load the incremental data only ? as the goal seems quite unr= ealistic. The big guns have already spoken :P *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy=2C and devotion to duty is the highest form of worship = of God.=94=20 "Maybe other people will try to limit me but I don't limit myself" From: Alex.Gauthier@Teradata.com To: user@hadoop.apache.org=3B mike.segel@thinkbiganalytics.com Subject: RE: One petabyte of data loading into HDFS with in 10 min. Date: Mon=2C 10 Sep 2012 16:17:20 +0000 Well said Mike. Lots of =93funny questions=94 around here lately=85 =20 From: Michael Segel [mailto:michael_segel@hotmail.com] Sent: Monday=2C September 10=2C 2012 4:50 AM To: user@hadoop.apache.org Cc: Michael Segel Subject: Re: One petabyte of data loading into HDFS with in 10 min. =20 =20 On Sep 10=2C 2012=2C at 2:40 AM=2C prabhu K wrote= : Hi Users=2C =20 Thanks for the response. =20 We have loaded 100GB data loaded into HDFS=2C time taken 1hr.with below con= figuration. Each Node (1 machine master=2C 2 machines are slave) 1. =20 500 GB hard disk. 2. =20 4Gb RAM 3. =20 3 quad code CPUs. 4. =20 Speed 1333 MHz =20 Now=2C we are planning to load 1 petabyte of data (single file) into Hadoo= p HDFS and Hive table within 10-20 minutes. For this we need a clarificatio= n below. Ok... =20 Some say that I am sometimes too harsh in my criticisms so take what I say = with a grain of salt... =20 You loaded 100GB in an hour using woefully underperforming hardware and are= now saying you want to load 1PB in 10 mins. =20 I would strongly suggest that you first learn more about Hadoop. No really= . Looking at your first machine=2C its obvious that you don't really grok h= adoop and what it requires to achieve optimum performance. You couldn't ev= en extrapolate any meaningful data from your current environment. =20 Secondly=2C I think you need to actually think about the problem. Did you m= ean PB or TB? Because your math seems to be off by a couple orders of magni= tude.=20 =20 A single file measured in PBs? That is currently impossible using today (20= 12) technology. In fact a single file that is measured in PBs wouldn't exis= t within the next 5 years and most likely the next decade. [Moore's law is = all about CPU power=2C not disk density.] =20 Also take a look at networking.=20 ToR switch design differs=2C however current technology=2C the fabric tends= to max out at 40GBs. What's the widest fabric on a backplane?=20 That's your first bottleneck because even if you had a 1PB of data=2C you c= ouldn't feed it to the cluster fast enough.=20 =20 Forget disk. look at PCIe based memory. (Money no object=2C right? )=20 You still couldn't populate it fast enough. =20 I guess Steve hit this nail on the head when he talked about this being a h= omework assignment.=20 =20 High school maybe?=20 =20 1. what are the system configuration setup required for all the 3 machine= =92s ?. 2. Hard disk size. 3. RAM size. 4. Mother board 5. Network cable 6. How much Gbps Infiniband required. For the same setup we need cloud computing environment too? Please suggest and help me on this. Thanks=2C Prabhu. On Fri=2C Sep 7=2C 2012 at 7:30 PM=2C Michael Segel wrote: Sorry=2C but you didn't account for the network saturation. And why 1GBe and not 10GBe? Also which version of hadoop? Here MapR works well with bonding two 10GBe ports and with the right switch= =2C you could do ok. Also 2 ToR switches... per rack. etc... How many machines? 150? 300? more? Then you don't talk about how much memory=2C CPUs=2C what type of storage..= . Lots of factors. I'm sorry to interrupt this mental masturbation about how to load 1PB in 10= min. There is a lot more questions that should be asked that weren't. Hey but look. Its a Friday=2C so I suggest some pizza=2C beer and then take= it to a white board. But what do I know? In a different thread=2C I'm talking about how to tame = HR and Accounting so they let me play with my team Ninja! :-P On Sep 5=2C 2012=2C at 9:56 AM=2C zGreenfelder wro= te: > On Wed=2C Sep 5=2C 2012 at 10:43 AM=2C Cosmin Lehene = wrote: >> Here's an extremely na=EFve ballpark estimation: at theoretical hardware >> speed=2C for 3PB representing 1PB with 3x replication >> >> Over a single 1Gbps connection (and I'm not sure=2C you can actually rea= ch >> 1Gbps) >> (3 petabytes) / (1 Gbps) =3D 291.271111 days >> >> So you'd need at least 40=2C000 1Gbps network cards to get that in 10 mi= nutes >> :) - (3PB/1Gbps)/40000 >> >> The actual number of nodes would depend a lot on the actual network >> architecture=2C the type of storage you use (SSD=2C HDD)=2C etc. >> >> Cosmin > > ah=2C I went te other direction with the math=2C and assumed no > replication (completely unsafe and never reasonable for a real=2C > production environment=2C but since we're all theory and just looking > for starting point numbers) > > > 1PB in 10 min =3D=3D > 1=2C000=2C000gB in 10 min =3D=3D > 8=2C000=2C000gb in 600 seconds =3D=3D > > 80=2C000/6 ~=3D 14k machines running at gigabit or about 1.5k machines i= f you > get 10Gb connected machines. > > all assuming there's no network or cluster sync overhead > (of course there would be) > > > that seems like some pretty deep pockets to get to < 10 minute load > time for that much data. > > I could also be off=2C I just threw some stuff together somewhat > quickly.between conf calls. > > -- > Even the Magic 8 ball has an opinion on email clients: Outlook not so goo= d. > =20 =20 = --_7505e3d2-5698-4ef3-b73a-e23377a16784_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable
Well can't you load the = incremental data only ? as the goal seems quite unrealistic. The big guns h= ave already spoken :P


*------------------------*
= Cheers !!!
= Siddharth Tiwari Have a r= efreshing day !!!"Every duty is holy=2C and devotion to duty is the highest form of wors= hip of God.=94
"Maybe other people will try to limit me but I do= n't limit myself"



From: Alex.Gauthier@Teradata.com
To: user@hadoop.apache.org=3B mik= e.segel@thinkbiganalytics.com
Subject: RE: One petabyte of data loading = into HDFS with in 10 min.
Date: Mon=2C 10 Sep 2012 16:17:20 +0000

Well sai= d Mike. Lots of =93funny questions=94 around here lately=85

 =3B=

From: Michael Segel [mailto:michael_segel@hotmail.com]
Sent: Monday=2C September 10=2C 2012 4:50 AM
To: user@hadoop.apache.org
Cc: Michael Segel
Subject: Re: One petabyte of data loading into HDFS with in 10 min.<= /span>

 =3B

 =3B

On Sep 10=2C 2012=2C at 2:40 AM=2C prabhu K <= =3Bprabhu.hadoop@gmail.com&g= t=3B wrote:



Hi Users=2C

 =3B

Thanks for the response.

 =3B

We have load= ed 100GB data loaded into HDFS=2C time taken 1hr.with below configuration.<= /span>

Each Node (1 machine master=2C 2 machines  =3Bare slave)<= /p>

1. =3B =3B =3B 500 GB hard disk.

2. =3B =3B =3B 4Gb RAM

3. =3B =3B =3B 3 quad code CPUs.

4. =3B =3B =3B Speed 1333 MHz

 =3B

Now=2C we ar= e planning to load 1 petabyte of data (single file)  =3Binto Hadoop HDF= S and Hive table within 10-20 minutes. For this we need a clarification bel= ow.

Ok...

 =3B

Some say that I am sometimes too harsh in my crit= icisms so take what I say with a grain of salt...

 =3B

You loaded 100GB in an hour using woefully underp= erforming hardware and are now saying you want to load 1PB in 10 mins.

 =3B

I would strongly suggest that you first learn mor= e about Hadoop.  =3BNo really. Looking at your first machine=2C its obv= ious that you don't really grok hadoop and what it requires to achieve opti= mum performance.  =3BYou couldn't even extrapolate any meaningful data from your current environment.

 =3B

Secondly=2C I think you need to actually think ab= out the problem. Did you mean PB or TB? Because your math seems to be off b= y a couple orders of magnitude. =3B

 =3B

A single file measured in PBs? That is currently = impossible using today (2012) technology. In fact a single file that is mea= sured in PBs wouldn't exist within the next 5 years and most likely the nex= t decade. [Moore's law is all about CPU power=2C not disk density.]

 =3B

Also take a look at networking. =3B

ToR switch design differs=2C however current tech= nology=2C the fabric tends to max out at 40GBs.  =3BWhat's the widest f= abric on a backplane? =3B

That's your first bottleneck because even if you = had a 1PB of data=2C you couldn't feed it to the cluster fast enough. = =3B

 =3B

Forget disk. look at PCIe based memory. (Money no= object=2C right? ) =3B

You still couldn't populate it fast enough.

 =3B

I guess Steve hit this nail on the head when he t= alked about this being a homework assignment. =3B

 =3B

High school maybe? =3B

 =3B



1. what are = the system configuration setup required for all the 3 machine=92s ?.=

2. Hard disk= size.

3. RAM size.=

4. Mother bo= ard

5. Network c= able

6. How much = Gbps  =3BInfiniband required.

 =3BFor = the same setup we need cloud computing environment too?

Please sugge= st and help me on this.

 =3BThan= ks=2C

Prabhu.

On Fri=2C Sep 7=2C 2012 at 7:30 PM=2C Michael Seg= el <=3Bmichael_segel@hotmail= .com>=3B wrote:

Sorry=2C but you didn't account for the network s= aturation.

And why 1GBe and not 10GBe? Also which version of hadoop?

Here MapR works well with bonding two 10GBe ports and with the right switch= =2C you could do ok.
Also 2 ToR switches... per rack. etc...

How many machines? 150? 300? more?

Then you don't talk about how much memory=2C CPUs=2C what type of storage..= .

Lots of factors.

I'm sorry to interrupt this mental masturbation about how to load 1PB in 10= min.
There is a lot more questions that should be asked that weren't.

Hey but look. Its a Friday=2C so I suggest some pizza=2C beer and then take= it to a white board.

But what do I know? In a different thread=2C I'm talking about how to tame = HR and Accounting so they let me play with my team Ninja!
:-P


On Sep 5=2C 2012=2C at 9:56 AM=2C zGreenfelder <=3Bzgreenfelder@gmail.com>=3B wrote:

>=3B On Wed=2C Sep 5=2C 2012 at 10:43 AM=2C Cosmin Lehene <=3Bclehene@adobe.com>=3B wrote:
>=3B>=3B Here's an extremely na=EFve ballpark estimation: at theoretica= l hardware
>=3B>=3B speed=2C for 3PB representing 1PB with 3x replication
>=3B>=3B
>=3B>=3B Over a single 1Gbps connection (and I'm not sure=2C you can ac= tually reach
>=3B>=3B 1Gbps)
>=3B>=3B (3 petabytes) / (1 Gbps) =3D 291.271111 days
>=3B>=3B
>=3B>=3B So you'd need at least 40=2C000 1Gbps network cards to get tha= t in 10 minutes
>=3B>=3B :) - (3PB/1Gbps)/40000
>=3B>=3B
>=3B>=3B The actual number of nodes would depend a lot on the actual ne= twork
>=3B>=3B architecture=2C the type of storage you use (SSD=2C  =3BHD= D)=2C etc.
>=3B>=3B
>=3B>=3B Cosmin
>=3B
>=3B ah=2C I went te other direction with the math=2C and assumed no
>=3B replication (completely unsafe and never reasonable for a real=2C >=3B production environment=2C but since we're all theory and just lookin= g
>=3B for starting point numbers)
>=3B
>=3B
>=3B 1PB in 10 min =3D=3D
>=3B 1=2C000=2C000gB in 10 min =3D=3D
>=3B 8=2C000=2C000gb in 600 seconds =3D=3D
>=3B
>=3B 80=2C000/6  =3B~=3D 14k machines running at gigabit or about 1.5= k machines if you
>=3B get 10Gb connected machines.
>=3B
>=3B all assuming there's no network or cluster sync overhead
>=3B (of course there would be)
>=3B
>=3B
>=3B that seems like some pretty deep pockets to get to <=3B 10 minute = load
>=3B time for that much data.
>=3B
>=3B I could also be off=2C I just threw some stuff together somewhat
>=3B quickly.between conf calls.
>=3B
>=3B --
>=3B Even the Magic 8 ball has an opinion on email clients: Outlook not s= o good.
>=3B

 =3B

 =3B

= --_7505e3d2-5698-4ef3-b73a-e23377a16784_--