hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thulasi Ram Naidu Peddineni <thulasiram...@gmail.com>
Subject Fetching data from S3 to EC2 cluster
Date Sat, 01 Oct 2011 12:38:38 GMT
Hi All,
     I have around 2.5 GB of data which is present in S3. To run EMR
jobs on this data, I am dowloading the data from S3 to HDFS using

hadoop distcp s3://<LOCATION> /tmp/

I am using 9 c1.xlarge (8 virtual cores with 2.5 EC2 Compute Units
each) which basically means that I have 72 cores available.
Hadoop is taking nearly 7 minutes to execute the above command where
the actually MapReduce job for distcp started after 5 minutes.

I tried to increase map tasks using "-m" option. But it is till taking
7 minutes. Can some one suggest me what is the best way to download
data from S3 to HDFS making use of all the available machines.

Thulasi Ram P

View raw message