hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rajiv Chittajallu <raj...@yahoo-inc.com>
Subject Re: Moving TB of data from NFS to HDFS
Date Wed, 25 Jan 2012 12:28:42 GMT
You will more likely be hitting NFS server limits way before you can see any noticible issues
with HDFS. 

Writes to a file are sequential. Total throughput for your transfer is dependent on number
of files and the rate at which files can be read from
NFS. If the total data set is split across reasonable number of files, say 2G, Upload rate
can be matched to the NFS server limits. 

On a small cluster, mounting the filesystem via NSF and using distcp with input path as file:///<path>
would work. 

Another option is making your files available via HTTP and runnin a simple streaming job to
parallelize the data pull.

It basically comes down to how you want to initiate the parallel copies.

-rajive

On Jan 25, 2012, at 1:19, Ajit Ratnaparkhi <ajit.ratnaparkhi@gmail.com> wrote:

> Hi raj,
> 
> If you have all data on NFS mounted disk, meaning on single machine, then
> your upload will be limited by network bandwidth. You can try running dfs
> -put in multiple parallel threads for distinct data sets, you might be able
> to utilise network bandwidth to its maximum(take care not to have too many
> threads otherwise namenode handlers will be busy all the time making dfs
> unresponsive). I dont see any other way to make it faster, making data
> upload faster require data source to be present at distributed locations
> which is not true in this case.
> 
> -Ajit
> 
> 
> On Wed, Jan 25, 2012 at 10:46 AM, Praveen Sripati
> <praveensripati@gmail.com>wrote:
> 
>>> If it is divided up into several files and you can mount your NFS
>> directory on each of the datanodes.
>> 
>> Just curious, how will this help.
>> 
>> Praveen
>> 
>> On Wed, Jan 25, 2012 at 12:39 AM, Robert Evans <evans@yahoo-inc.com>
>> wrote:
>> 
>>> If it is divided up into several files and you can mount your NFS
>>> directory on each of the datanodes, you could possibly use distcp to do
>> it.
>>> I have never tried using distcp for this, but it should work.  Or you
>> can
>>> write your own streaming Map/Reduce script that does more or less the
>> same
>>> thing as distcp and will take as input the list of files to copy, and
>> will
>>> do a hadoop fs -put for each file having it come from NFS.
>>> 
>>> --Bobby Evans
>>> 
>>> On 1/24/12 12:49 AM, "rajmca2002" <rajmca2002@gmail.com> wrote:
>>> 
>>> 
>>> 
>>> Hi,
>>> 
>>> I have TB of Data in NFS i need to move this data to hdfs. I have used
>>> hadoop put command to do the same, but it resulted in taking hours to
>> place
>>> the file in HDFS, Is there any good approach to move large files to hdfs.
>>> 
>>> Please reply asap.
>>> --
>>> View this message in context:
>>> 
>> http://old.nabble.com/Moving-TB-of-data-from-NFS-to-HDFS-tp33193061p33193061.html
>>> Sent from the Hadoop core-dev mailing list archive at Nabble.com.
>>> 
>>> 
>>> 
>> 

Mime
View raw message