hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From johny casanova <pcgamer2...@outlook.com>
Subject RE: Copying many files to HDFS
Date Fri, 13 Feb 2015 15:38:40 GMT

 Hi Kevin,

 

You can try Apache nifi https://nifi.incubator.apache.org/ is a new application that is still
in incubation but, awesome tool to use for what you are looking for. Ithas a processor that
put data and get data from HDFS and send continuously without having to use the put command.
Check them out and let me know if you need help. I use it to put to HDFS also and put high
volumes like you mentioned.



Date: Fri, 13 Feb 2015 09:25:35 -0500
Subject: Re: Copying many files to HDFS
From: kevin.macksamie@gmail.com
To: user@hadoop.apache.org


Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy the files in their
entirety and keep their file names.




Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable multi-threaded application
to copy files. I'll keep looking at it.


On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wget.null@gmail.com> wrote:



Kevin,


Slurper can help here:
https://github.com/alexholmes/hdfs-file-slurper


BR,
 Alexander 







On 13 Feb 2015, at 14:28, Kevin <kevin.macksamie@gmail.com> wrote:


Hi,


I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into
HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with
a single client machine that is connected to the Hadoop cluster as well as the public network.
The data that needs to be copied into HDFS is mounted as an NFS on the client machine.


I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput.


If these files were able to be accessed by each node in the Hadoop cluster, then I could write
a MapReduce job to copy a number of files from the network into HDFS. I could not find anything
in the documentation saying that `distcp` works with locally hosted files (its code in the
tools package doesn't tell any sign of it either) - but I wouldn't expect it to.


In general, are there any other ways of copying a very large number of client-local files
to HDFS? I search the mail archives to find a similar question and I didn't come across one.
I'm sorry if this is a duplicate question.



Thanks for your time,
Kevin

 		 	   		  
Mime
View raw message