hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: bulk data transfer to HDFS remotely (e.g. via wan)
Date Tue, 02 Mar 2010 21:00:06 GMT
Hey Michael,

We've developed a GridFTP server plugin that writes directly into Hadoop, so there's no intermediate
data staging required.  You can just use your favorite GridFTP client on the source machine
and transfer it directly into Hadoop.  Globus GridFTP can do checksums as it goes, but I haven't
tried it - it might not work with our plugin.  The GridFTP server does not need to co-exist
with any Hadoop processes - it just needs a network connection to the WAN and a network connection
to the LAN.

The GridFTP server is automatically installed with our yum packaging, along with our organization's
CA certs.  If this is a one-off transfer - or you don't already have the CA certificate/grid
infrastructure already available in your organization - you might be better served approaching
another solution.

The setup works well for us because (a) the other 40 sites use GridFTP as a common protocol,
(b) we have a long history with using GridFTP, and (c) we need to transfer many TB on a daily
basis.

Brian

On Mar 2, 2010, at 12:10 PM, jiang licht wrote:

> Hi Brian,
> 
> Thanks a lot for sharing your experience. Here I have some questions to bother you for
more help :)
> 
> So, basically means that data transfer in your case is 2-step job: 1st, use gridftp to
make a local copy of data on target, 2nd load data into the target cluster by sth like "hadoop
fs -put". If this is correct, I am wondering if this will consume too much disk space of your
target box (since it is stored in a local file system, prior to be distributed to hadoop cluster).
Also, do you do a integrity check for each file transferred (one straightforward method might
be to do a 'cksum' or alike comparison, but is that doable in terms of efficiency)?
> 
> I am not familiar with gridftp except that I know it is a better choice compared to scp,
sftp, etc. in that it can tune tcp settings and create parallel transfer. So, I want to know
if it keeps a log of what files have been successfully transferred and what have not, does
gridftp do a file integrity check? Right now, I only have one box for data storage (not in
hadoop cluster) and want to transfer that data to hadoop. Can I just install gridftp on this
box and name node box to enable gridftp transfer from the 1st to the 2nd?
> 
> Thanks,
> --
> 
> Michael
> 
> --- On Tue, 3/2/10, Brian Bockelman <bbockelm@cse.unl.edu> wrote:
> 
> From: Brian Bockelman <bbockelm@cse.unl.edu>
> Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
> To: common-user@hadoop.apache.org
> Date: Tuesday, March 2, 2010, 8:38 AM
> 
> Hey Michael,
> 
> distcp does a MapReduce job to transfer data between two clusters - but it might not
be acceptable security-wise for your setup.
> 
> Locally, we use gridftp between two clusters (not necessarily Hadoop!) and a protocol
called SRM to load-balance between gridftp servers.  GridFTP was selected because it is common
in our field, and we already have the certificate infrastructure well setup.
> 
> GridFTP is fast too - many Gbps is not too hard.
> 
> YMMV
> 
> Brian
> 
> On Mar 2, 2010, at 1:30 AM, jiang licht wrote:
> 
>> I am considering a basic task of loading data to hadoop cluster in this scenario:
hadoop cluster and bulk data reside on different boxes, e.g. connected via LAN or wan.
>>   
>> An example to do this is to move data from amazon s3 to ec2, which is supported in
latest hadoop by specifying s3(n)://authority/path in distcp.
>>   
>> But generally speaking, what is the best way to load data to hadoop cluster from
a remote box? Clearly, in this scenario, it is unreasonable to copy data to local name node
and then issue some command like "hadoop fs -copyFromLocal" to put data in the cluster (besides
this, a desired data transfer tool is also a factor, scp or sftp, gridftp, ..., compression
and encryption, ...).
>>   
>> I am not awaring of a generic support for fetching data from a remote box (like that
from s3 or s3n), I am thinking about the following solution (run on remote boxes to push data
to hadoop):
>>   
>> cat datafile | ssh hadoopbox 'hadoop fs -put - dst'
>>   
>> There are pros (simple and will do the job without storing a local copy of each data
file and then do a command like 'hadoop fs -copyFromLocal') and cons (obviously will need
many such pipelines running in parallel to speed up the job, but at the cost of creating processes
on remote machines to read data and maintain ssh connections, so if data file is small, better
archive small files into a tar file before calling 'cat'). Alternative to using a 'cat', a
program can be written to keep reading data files and dump to stdin in parallel.
>>   
>> Any comments about this or thoughts about a better solution?
>>   
>> Thanks,
>> --
>> Michael
>> 
>> 
> 
> 
> 
> 


Mime
View raw message