hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin <kevin.macksa...@gmail.com>
Subject Copying many files to HDFS
Date Fri, 13 Feb 2015 13:28:47 GMT

I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
or so files into HDFS, which totals roughly 1 TB. The cluster will be
isolated on its own private LAN with a single client machine that is
connected to the Hadoop cluster as well as the public network. The data
that needs to be copied into HDFS is mounted as an NFS on the client

I can run `hadoop fs -put` concurrently on the client machine to try and
increase the throughput.

If these files were able to be accessed by each node in the Hadoop cluster,
then I could write a MapReduce job to copy a number of files from the
network into HDFS. I could not find anything in the documentation saying
that `distcp` works with locally hosted files (its code in the tools
package doesn't tell any sign of it either) - but I wouldn't expect it to.

In general, are there any other ways of copying a very large number of
client-local files to HDFS? I search the mail archives to find a similar
question and I didn't come across one. I'm sorry if this is a duplicate

Thanks for your time,

View raw message