hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Alten-Lorenz <wget.n...@gmail.com>
Subject Re: Copying many files to HDFS
Date Fri, 13 Feb 2015 14:03:11 GMT
Kevin,

Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper <https://github.com/alexholmes/hdfs-file-slurper>

BR,
 Alexander 


> On 13 Feb 2015, at 14:28, Kevin <kevin.macksamie@gmail.com> wrote:
> 
> Hi,
> 
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files
into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN
with a single client machine that is connected to the Hadoop cluster as well as the public
network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine.
> 
> I can run `hadoop fs -put` concurrently on the client machine to try and increase the
throughput.
> 
> If these files were able to be accessed by each node in the Hadoop cluster, then I could
write a MapReduce job to copy a number of files from the network into HDFS. I could not find
anything in the documentation saying that `distcp` works with locally hosted files (its code
in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to.
> 
> In general, are there any other ways of copying a very large number of client-local files
to HDFS? I search the mail archives to find a similar question and I didn't come across one.
I'm sorry if this is a duplicate question.
> 
> Thanks for your time,
> Kevin


Mime
View raw message