hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jiang licht <licht_ji...@yahoo.com>
Subject Re: bulk data transfer to HDFS remotely (e.g. via wan)
Date Tue, 02 Mar 2010 18:10:20 GMT
Hi Brian,

Thanks a lot for sharing your experience. Here I have some questions to bother you for more
help :)

So, basically means that data transfer in your case is 2-step job: 1st, use gridftp to make
a local copy of data on target, 2nd load data into the target cluster by sth like "hadoop
fs -put". If this is correct, I am wondering if this will consume too much disk space of your
target box (since it is stored in a local file system, prior to be distributed to hadoop cluster).
Also, do you do a integrity check for each file transferred (one straightforward method might
be to do a 'cksum' or alike comparison, but is that doable in terms of efficiency)?

I am not familiar with gridftp except that I know it is a better choice compared to scp, sftp,
etc. in that it can tune tcp settings and create parallel transfer. So, I want to know if
it keeps a log of what files have been successfully transferred and what have not, does gridftp
do a file integrity check? Right now, I only have one box for data storage (not in hadoop
cluster) and want to transfer that data to hadoop. Can I just install gridftp on this box
and name node box to enable gridftp transfer from the 1st to the 2nd?

Thanks,
--

Michael

--- On Tue, 3/2/10, Brian Bockelman <bbockelm@cse.unl.edu> wrote:

From: Brian Bockelman <bbockelm@cse.unl.edu>
Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
To: common-user@hadoop.apache.org
Date: Tuesday, March 2, 2010, 8:38 AM

Hey Michael,

distcp does a MapReduce job to transfer data between two clusters - but it might not be acceptable
security-wise for your setup.

Locally, we use gridftp between two clusters (not necessarily Hadoop!) and a protocol called
SRM to load-balance between gridftp servers.  GridFTP was selected because it is common in
our field, and we already have the certificate infrastructure well setup.

GridFTP is fast too - many Gbps is not too hard.

YMMV

Brian

On Mar 2, 2010, at 1:30 AM, jiang licht wrote:

> I am considering a basic task of loading data to hadoop cluster in this scenario: hadoop
cluster and bulk data reside on different boxes, e.g. connected via LAN or wan.
>  
> An example to do this is to move data from amazon s3 to ec2, which is supported in latest
hadoop by specifying s3(n)://authority/path in distcp.
>  
> But generally speaking, what is the best way to load data to hadoop cluster from a remote
box? Clearly, in this scenario, it is unreasonable to copy data to local name node and then
issue some command like "hadoop fs -copyFromLocal" to put data in the cluster (besides this,
a desired data transfer tool is also a factor, scp or sftp, gridftp, ..., compression and
encryption, ...).
>  
> I am not awaring of a generic support for fetching data from a remote box (like that
from s3 or s3n), I am thinking about the following solution (run on remote boxes to push data
to hadoop):
>  
> cat datafile | ssh hadoopbox 'hadoop fs -put - dst'
>  
> There are pros (simple and will do the job without storing a local copy of each data
file and then do a command like 'hadoop fs -copyFromLocal') and cons (obviously will need
many such pipelines running in parallel to speed up the job, but at the cost of creating processes
on remote machines to read data and maintain ssh connections, so if data file is small, better
archive small files into a tar file before calling 'cat'). Alternative to using a 'cat', a
program can be written to keep reading data files and dump to stdin in parallel.
>  
> Any comments about this or thoughts about a better solution?
>  
> Thanks,
> --
> Michael
> 
> 




      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message