hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "D'Souza, Clive V" <clive.v.d'so...@intel.com>
Subject RE: One petabyte of data loading into HDFS with in 10 min.
Date Wed, 05 Sep 2012 14:58:57 GMT
Have you looked at using Infiniband fabric? You can get 4X higher throughput than 10GbE.



Regards,
-C
 

-----Original Message-----
From: zGreenfelder [mailto:zgreenfelder@gmail.com] 
Sent: Wednesday, September 05, 2012 7:57 AM
To: user@hadoop.apache.org
Subject: Re: One petabyte of data loading into HDFS with in 10 min.

On Wed, Sep 5, 2012 at 10:43 AM, Cosmin Lehene <clehene@adobe.com> wrote:
> Here's an extremely naïve ballpark estimation: at theoretical hardware 
> speed, for 3PB representing 1PB with 3x replication
>
> Over a single 1Gbps connection (and I'm not sure, you can actually 
> reach
> 1Gbps)
> (3 petabytes) / (1 Gbps) = 291.271111 days
>
> So you'd need at least 40,000 1Gbps network cards to get that in 10 
> minutes
> :) - (3PB/1Gbps)/40000
>
> The actual number of nodes would depend a lot on the actual network 
> architecture, the type of storage you use (SSD,  HDD), etc.
>
> Cosmin

ah, I went te other direction with the math, and assumed no replication (completely unsafe
and never reasonable for a real, production environment, but since we're all theory and just
looking for starting point numbers)


1PB in 10 min ==
1,000,000gB in 10 min ==
8,000,000gb in 600 seconds ==

80,000/6  ~= 14k machines running at gigabit or about 1.5k machines if you get 10Gb connected
machines.

all assuming there's no network or cluster sync overhead (of course there would be)


that seems like some pretty deep pockets to get to < 10 minute load time for that much
data.

I could also be off, I just threw some stuff together somewhat quickly.between conf calls.

--
Even the Magic 8 ball has an opinion on email clients: Outlook not so good.

Mime
View raw message