hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmed Ossama <ah...@aossama.com>
Subject Re: Copying many files to HDFS
Date Fri, 13 Feb 2015 13:36:38 GMT
Hi Kevin,

Have a look at Apache Flume. It collects large amounts of data.


On 02/13/2015 03:28 PM, Kevin wrote:
> Hi,
> I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a 
> thousand or so files into HDFS, which totals roughly 1 TB. The cluster 
> will be isolated on its own private LAN with a single client machine 
> that is connected to the Hadoop cluster as well as the public network. 
> The data that needs to be copied into HDFS is mounted as an NFS on the 
> client machine.
> I can run `hadoop fs -put` concurrently on the client machine to try 
> and increase the throughput.
> If these files were able to be accessed by each node in the Hadoop 
> cluster, then I could write a MapReduce job to copy a number of files 
> from the network into HDFS. I could not find anything in the 
> documentation saying that `distcp` works with locally hosted files 
> (its code in the tools package doesn't tell any sign of it either) - 
> but I wouldn't expect it to.
> In general, are there any other ways of copying a very large number of 
> client-local files to HDFS? I search the mail archives to find a 
> similar question and I didn't come across one. I'm sorry if this is a 
> duplicate question.
> Thanks for your time,
> Kevin

Ahmed Ossama

View raw message