Hello Shailesh,

      Give distcp a shot. It runs a MR for copying data from source to destination, so the data can be copied parallely.

    Mohammad Tariq

On Wed, Sep 5, 2012 at 7:44 PM, Shailesh Dargude <Shailesh_Dargude@symantec.com> wrote:

Sorry Prabhu for hijacking this discussion a bit..  I wonder , what is the best practice to load the data in HDFS in general. Considering the size of the data ( many times its in gbs or TBs generally),   how are storage  and time constraints handled.


If anybody  can share your experiences or best practice it would great!




From: Chen He [mailto:airbots@gmail.com]
Sent: Wednesday, September 05, 2012 7:34 PM
To: user@hadoop.apache.org
Subject: Re: One petabyte of data loading into HDFS with in 10 min.


If it is not a single file, you can upload them using multiple threads to HDFS.

On Wed, Sep 5, 2012 at 7:21 AM, prabhu K <prabhu.hadoop@gmail.com> wrote:

Hi Users,


Please clarify the below questions.


1. With in 10 minutes one petabyte of data load into HDFS/HIVE , how many slave (Data Nodes) machines required.


2. With in 10 minutes one petabyte of data load into HDFS/HIVE, what is the configuration setup for cloud computing.


Please suggest and help me on this.