hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin <kevin.macksa...@gmail.com>
Subject Re: Copying many files to HDFS
Date Mon, 16 Feb 2015 14:10:40 GMT
Johny, NiFi looks interesting but I can't really grasp how it will help me.
If you could provided some example code or a more detail explanation of how
you set up a topology, then that would be great.

On Fri, Feb 13, 2015 at 10:38 AM, johny casanova <pcgamer2426@outlook.com>

>  Hi Kevin,
> You can try Apache nifi https://nifi.incubator.apache.org/ is a new
> application that is still in incubation but, awesome tool to use for what
> you are looking for. Ithas a processor that put data and get data from HDFS
> and send continuously without having to use the put command. Check them out
> and let me know if you need help. I use it to put to HDFS also and put high
> volumes like you mentioned.
>  ------------------------------
> Date: Fri, 13 Feb 2015 09:25:35 -0500
> Subject: Re: Copying many files to HDFS
> From: kevin.macksamie@gmail.com
> To: user@hadoop.apache.org
> Ahmed,
> Flume is a great tool but it doesn't cover my use case. I need to copy the
> files in their entirety and keep their file names.
> Alexander,
> Thanks for sharing Slurper. From the code it looks like a reasonable
> multi-threaded application to copy files. I'll keep looking at it.
> On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <
> wget.null@gmail.com> wrote:
>  Kevin,
> Slurper can help here:
> https://github.com/alexholmes/hdfs-file-slurper
> BR,
>  Alexander
>  On 13 Feb 2015, at 14:28, Kevin <kevin.macksamie@gmail.com> wrote:
>  Hi,
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
> or so files into HDFS, which totals roughly 1 TB. The cluster will be
> isolated on its own private LAN with a single client machine that is
> connected to the Hadoop cluster as well as the public network. The data
> that needs to be copied into HDFS is mounted as an NFS on the client
> machine.
> I can run `hadoop fs -put` concurrently on the client machine to try and
> increase the throughput.
> If these files were able to be accessed by each node in the Hadoop
> cluster, then I could write a MapReduce job to copy a number of files from
> the network into HDFS. I could not find anything in the documentation
> saying that `distcp` works with locally hosted files (its code in the tools
> package doesn't tell any sign of it either) - but I wouldn't expect it to.
> In general, are there any other ways of copying a very large number of
> client-local files to HDFS? I search the mail archives to find a similar
> question and I didn't come across one. I'm sorry if this is a duplicate
> question.
> Thanks for your time,
> Kevin

View raw message