hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <og...@nengoiksvelzud.com>
Subject HDFS question
Date Tue, 28 Jan 2014 16:42:59 GMT
Hello,

I am new to Hadoop and HDFS so maybe I am not understanding things properly
but I have the following issue:

I have set up a name node and a bunch of data nodes for HDFS. Each node
contributes 1.6TB of space so the total space shown on the hdfs web front
end is about 25TB. I have set the replication to be 3.

I am downloading large files on a single data node from Amazon's S3 using
the -distcp command - like this:

 hadoop --config /etc/hadoop distcp
s3n://AKIAIUHOFVALO67O6FJQ:DV86+JnmNiMGZH9VpdtaZZ8ZJQKyDxy6yKtDBLPp@data-pipeline/large_data/2013-12-02.json
hdfs://10.10.0.198:54310/test/2013-12-03.json

Where 10.10.0.198 is the Hadoop Name node.

All I am getting is that the machine I am running these commands on (one of
the data nodes) is getting all the files - they do not seem to be
"spreading" around the HDFS cluster.

Is this expected? Did I completely misunderstand the point of a parallel
DISTRIBUTED file system? :)

Thanks!
Ognen

Mime
View raw message