hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gulfie <gul...@haruko.grotto-group.com>
Subject Re: One petabyte of data loading into HDFS with in 10 min.
Date Thu, 06 Sep 2012 20:52:53 GMT

Back up for a second.  Why would you want to do this and where does the data come from?

Is this a new PB of data every time? or is it PB total with some new and some old?
Only migrating the deltas could help. 

Can the data migration/load have it's latency hidden?  Is the PB of data ready all at once?
Is the first 100TB ready to be loaded long before the last 100TB is written/gathered/generated?

Is it possible to generate/gather the data into HDFS originally so there is no initial load
time penalty?

1PB / 10 minutes = 26 Terabits / second throughput ( 3x that for naive data redundancy ).
 That is a lot. 
Not crazy a lot, but a lot.  Todays large core switches/routers can do single multi Tb/sec,
you'd need a 
fleet of them or use openflow.

Redundancy will require going across a node to node network of some sort be it SAN, Ethernet
or whatever. 
By building a special purpose back end replication network/nodes you may be able to decrease
network costs.

If you really want to push this much data around that quickly the only type of network that
makes sense
one that avoids over subscription.  Look into Fat tree networks as a start. 

Tens of thousands of nodes running at gigabit or thousands of nodes running at 10gig, or hundreds
of nodes running
infiniband (40gbit). 

The biggest question is can you avoid having to do this much data migration?  Networks aren't
getting faster
as fast as CPUs are.  A long term architecture based on growing datasets and data migration
is looking for trouble.


On Wed, Sep 05, 2012 at 05:51:50PM +0530, prabhu K wrote:
> Hi Users,
> Please clarify the below questions.
> 1. With in 10 minutes one petabyte of data load into HDFS/HIVE , how many
> slave (Data Nodes) machines required.
> 2. With in 10 minutes one petabyte of data load into HDFS/HIVE, what is the
> configuration setup for cloud computing.
> Please suggest and help me on this.
> Thanks&Regards,
> Prabhu.

View raw message