Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E07ADD6A3 for ; Mon, 10 Sep 2012 11:50:57 +0000 (UTC) Received: (qmail 86691 invoked by uid 500); 10 Sep 2012 11:50:52 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 86616 invoked by uid 500); 10 Sep 2012 11:50:52 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 86594 invoked by uid 99); 10 Sep 2012 11:50:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Sep 2012 11:50:51 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of michael_segel@hotmail.com designates 65.55.111.109 as permitted sender) Received: from [65.55.111.109] (HELO blu0-omc2-s34.blu0.hotmail.com) (65.55.111.109) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Sep 2012 11:50:41 +0000 Received: from BLU0-SMTP196 ([65.55.111.73]) by blu0-omc2-s34.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Mon, 10 Sep 2012 04:50:20 -0700 X-Originating-IP: [173.15.87.37] X-EIP: [zG/ZZ01ClvSKFpmxLiH7kDu6ngS9fsjZ] X-Originating-Email: [michael_segel@hotmail.com] Message-ID: Received: from [192.168.0.104] ([173.15.87.37]) by BLU0-SMTP196.blu0.hotmail.com over TLS secured channel with Microsoft SMTPSVC(6.0.3790.4675); Mon, 10 Sep 2012 04:50:18 -0700 Content-Type: multipart/alternative; boundary="Apple-Mail=_05CF2969-29FD-405B-8879-80BF0E2B0631" MIME-Version: 1.0 (Mac OS X Mail 6.0 \(1486\)) Subject: Re: One petabyte of data loading into HDFS with in 10 min. From: Michael Segel In-Reply-To: Date: Mon, 10 Sep 2012 06:50:17 -0500 CC: Michael Segel References: To: user@hadoop.apache.org X-Mailer: Apple Mail (2.1486) X-OriginalArrivalTime: 10 Sep 2012 11:50:18.0713 (UTC) FILETIME=[7339C490:01CD8F4A] --Apple-Mail=_05CF2969-29FD-405B-8879-80BF0E2B0631 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="windows-1252" On Sep 10, 2012, at 2:40 AM, prabhu K wrote: > Hi Users, > =20 > Thanks for the response. > =20 > We have loaded 100GB data loaded into HDFS, time taken 1hr.with below = configuration. >=20 > Each Node (1 machine master, 2 machines are slave) >=20 > 1. 500 GB hard disk. > 2. 4Gb RAM > 3. 3 quad code CPUs. > 4. Speed 1333 MHz > =20 > Now, we are planning to load 1 petabyte of data (single file) into = Hadoop HDFS and Hive table within 10-20 minutes. For this we need a = clarification below. >=20 >=20 Ok... Some say that I am sometimes too harsh in my criticisms so take what I = say with a grain of salt... You loaded 100GB in an hour using woefully underperforming hardware and = are now saying you want to load 1PB in 10 mins. I would strongly suggest that you first learn more about Hadoop. No = really. Looking at your first machine, its obvious that you don't really = grok hadoop and what it requires to achieve optimum performance. You = couldn't even extrapolate any meaningful data from your current = environment. Secondly, I think you need to actually think about the problem. Did you = mean PB or TB? Because your math seems to be off by a couple orders of = magnitude.=20 A single file measured in PBs? That is currently impossible using today = (2012) technology. In fact a single file that is measured in PBs = wouldn't exist within the next 5 years and most likely the next decade. = [Moore's law is all about CPU power, not disk density.] Also take a look at networking.=20 ToR switch design differs, however current technology, the fabric tends = to max out at 40GBs. What's the widest fabric on a backplane?=20 That's your first bottleneck because even if you had a 1PB of data, you = couldn't feed it to the cluster fast enough.=20 Forget disk. look at PCIe based memory. (Money no object, right? )=20 You still couldn't populate it fast enough. I guess Steve hit this nail on the head when he talked about this being = a homework assignment.=20 High school maybe?=20 > 1. what are the system configuration setup required for all the 3 = machine=92s ?. >=20 > 2. Hard disk size. >=20 > 3. RAM size. >=20 > 4. Mother board >=20 > 5. Network cable >=20 > 6. How much Gbps Infiniband required. >=20 > For the same setup we need cloud computing environment too? >=20 > Please suggest and help me on this. >=20 > Thanks, >=20 > Prabhu. >=20 > On Fri, Sep 7, 2012 at 7:30 PM, Michael Segel = wrote: > Sorry, but you didn't account for the network saturation. >=20 > And why 1GBe and not 10GBe? Also which version of hadoop? >=20 > Here MapR works well with bonding two 10GBe ports and with the right = switch, you could do ok. > Also 2 ToR switches... per rack. etc... >=20 > How many machines? 150? 300? more? >=20 > Then you don't talk about how much memory, CPUs, what type of = storage... >=20 > Lots of factors. >=20 > I'm sorry to interrupt this mental masturbation about how to load 1PB = in 10min. > There is a lot more questions that should be asked that weren't. >=20 > Hey but look. Its a Friday, so I suggest some pizza, beer and then = take it to a white board. >=20 > But what do I know? In a different thread, I'm talking about how to = tame HR and Accounting so they let me play with my team Ninja! > :-P >=20 > On Sep 5, 2012, at 9:56 AM, zGreenfelder = wrote: >=20 > > On Wed, Sep 5, 2012 at 10:43 AM, Cosmin Lehene = wrote: > >> Here's an extremely na=EFve ballpark estimation: at theoretical = hardware > >> speed, for 3PB representing 1PB with 3x replication > >> > >> Over a single 1Gbps connection (and I'm not sure, you can actually = reach > >> 1Gbps) > >> (3 petabytes) / (1 Gbps) =3D 291.271111 days > >> > >> So you'd need at least 40,000 1Gbps network cards to get that in 10 = minutes > >> :) - (3PB/1Gbps)/40000 > >> > >> The actual number of nodes would depend a lot on the actual network > >> architecture, the type of storage you use (SSD, HDD), etc. > >> > >> Cosmin > > > > ah, I went te other direction with the math, and assumed no > > replication (completely unsafe and never reasonable for a real, > > production environment, but since we're all theory and just looking > > for starting point numbers) > > > > > > 1PB in 10 min =3D=3D > > 1,000,000gB in 10 min =3D=3D > > 8,000,000gb in 600 seconds =3D=3D > > > > 80,000/6 ~=3D 14k machines running at gigabit or about 1.5k = machines if you > > get 10Gb connected machines. > > > > all assuming there's no network or cluster sync overhead > > (of course there would be) > > > > > > that seems like some pretty deep pockets to get to < 10 minute load > > time for that much data. > > > > I could also be off, I just threw some stuff together somewhat > > quickly.between conf calls. > > > > -- > > Even the Magic 8 ball has an opinion on email clients: Outlook not = so good. > > >=20 >=20 --Apple-Mail=_05CF2969-29FD-405B-8879-80BF0E2B0631 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="windows-1252" prabhu.hadoop@gmail.com> = wrote:
Hi Users,
 
Thanks for the response.
 

We have loaded 100GB data loaded into = HDFS, time taken 1hr.with below configuration.

Each Node (1 machine = master, 2 machines  are = slave)

1.    500 GB hard disk.
2.    = 4Gb = RAM
3.    3 quad code CPUs.
4.    = Speed = 1333 MHz
 

Now, = we are planning to load 1 petabyte of data (single file)  into Hadoop HDFS and Hive table within 10-20 = minutes. For this we need a clarification below.

Ok...

So= me say that I am sometimes too harsh in my criticisms so take what I say = with a grain of salt...

You loaded 100GB in an = hour using woefully underperforming hardware and are now saying you want = to load 1PB in 10 mins.

I would strongly = suggest that you first learn more about Hadoop.  No really. Looking = at your first machine, its obvious that you don't really grok hadoop and = what it requires to achieve optimum performance.  You couldn't even = extrapolate any meaningful data from your current = environment.

Secondly, I think you need to = actually think about the problem. Did you mean PB or TB? Because your = math seems to be off by a couple orders of = magnitude. 

A single file measured in PBs? = That is currently impossible using today (2012) technology. In fact a = single file that is measured in PBs wouldn't exist within the next 5 = years and most likely the next decade. [Moore's law is all about CPU = power, not disk density.]

Also take a look at = networking. 
ToR switch design differs, however current = technology, the fabric tends to max out at 40GBs.  What's the = widest fabric on a backplane? 
That's your first = bottleneck because even if you had a 1PB of data, you couldn't feed it = to the cluster fast enough. 

Forget disk. = look at PCIe based memory. (Money no object, right? = ) 
You still couldn't populate it fast = enough.

I guess Steve hit this nail on the head = when he talked about this being a homework = assignment. 

High school = maybe? 


1. what are = the system configuration setup required for all the 3 machine=92s = ?.

2. Hard disk = size.

3. RAM = size.

4. Mother = board

5. Network = cable

6. How much = Gbps  Infiniband = required.

 For the same setup we need cloud = computing environment too?

Please = suggest and help me on this.

 Thanks,

Prabhu.

On Fri, Sep 7, 2012 at 7:30 PM, Michael Segel = <michael_segel@hotmail.com> wrote:
Sorry, but you didn't = account for the network saturation.

And why 1GBe and not 10GBe? = Also which version of hadoop?

Here MapR works well with bonding two 10GBe ports and with the right = switch, you could do ok.
Also 2 ToR switches... per rack. = etc...

How many machines? 150? 300? more?

Then you don't = talk about how much memory, CPUs, what type of storage...

Lots of factors.

I'm sorry to interrupt this mental = masturbation about how to load 1PB in 10min.
There is a lot more = questions that should be asked that weren't.

Hey but look. Its a = Friday, so I suggest some pizza, beer and then take it to a white = board.

But what do I know? In a different thread, I'm talking about how to = tame HR and Accounting so they let me play with my team = Ninja!
:-P

On Sep 5, 2012, at 9:56 AM, zGreenfelder <zgreenfelder@gmail.com> = wrote:

> On Wed, Sep 5, 2012 at 10:43 AM, Cosmin Lehene <clehene@adobe.com> wrote:
>> Here's an extremely na=EFve ballpark estimation: at theoretical = hardware
>> speed, for 3PB representing 1PB with 3x = replication
>>
>> Over a single 1Gbps connection (and = I'm not sure, you can actually reach
>> 1Gbps)
>> (3 petabytes) / (1 Gbps) =3D 291.271111 = days
>>
>> So you'd need at least 40,000 1Gbps network = cards to get that in 10 minutes
>> :) - = (3PB/1Gbps)/40000
>>
>> The actual number of nodes would depend a lot on the actual = network
>> architecture, the type of storage you use (SSD, =  HDD), etc.
>>
>> Cosmin
>
> ah, I = went te other direction with the math, and assumed no
> replication (completely unsafe and never reasonable for a = real,
> production environment, but since we're all theory and = just looking
> for starting point numbers)
>
>
> = 1PB in 10 min =3D=3D
> 1,000,000gB in 10 min =3D=3D
> 8,000,000gb in 600 seconds = =3D=3D
>
> 80,000/6  ~=3D 14k machines running at = gigabit or about 1.5k machines if you
> get 10Gb connected = machines.
>
> all assuming there's no network or cluster = sync overhead
> (of course there would be)
>
>
> that seems like = some pretty deep pockets to get to < 10 minute load
> time for = that much data.
>
> I could also be off, I just threw some = stuff together somewhat
> quickly.between conf calls.
>
> --
> Even the = Magic 8 ball has an opinion on email clients: Outlook not so = good.
>



= --Apple-Mail=_05CF2969-29FD-405B-8879-80BF0E2B0631--