hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin <kevin.macksa...@gmail.com>
Subject Re: Copying many files to HDFS
Date Fri, 13 Feb 2015 14:25:35 GMT
Flume is a great tool but it doesn't cover my use case. I need to copy the
files in their entirety and keep their file names.

Thanks for sharing Slurper. From the code it looks like a reasonable
multi-threaded application to copy files. I'll keep looking at it.

On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wget.null@gmail.com
> wrote:

> Kevin,
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
> BR,
>  Alexander
> On 13 Feb 2015, at 14:28, Kevin <kevin.macksamie@gmail.com> wrote:
> Hi,
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
> Thanks for your time,
> Kevin

View raw message