hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jiang licht <licht_ji...@yahoo.com>
Subject Re: bulk data transfer to HDFS remotely (e.g. via wan)
Date Tue, 02 Mar 2010 21:51:46 GMT
Thanks, Brian.

There is no certificate/grid infrastructure as of now yet for us. But I guess I can still
use gridftp by noticing the following from its FAQ page: GridFTP can be run in a mode using
standard
 SSH security credentials. It can also be run in anonymous mode and 
with username/password authentication.

I am wondering how gridftp can used in a generic scenario: transfer bulk data from a box (not
in hadoop cluster) to a remote hadoop cluster at a regular interval (maybe hourly or couple
of minutes). So, I guess I can install gridftp server on hadoop name node and install gridftp
client on the remote data box. But to bypass the intermediate step of keeping a local copy
on hadoop name node, I need something like the plugin you mentioned. Is that correct?

Since I dont have the plugin you have, I found a helpful article here that might address the
problem:

http://osg-test2.unl.edu/documentation/hadoop/gridftp-hdfs

It seems to me that it can directly write data to hadoop (although I don't know exactly how).
But I am not sure how to direct gridftp client to write data to hadoop, sth like "globus-url-copy
localurl hdfs://hadoopnamenode/pathinhdfs"? Otherwise, there might be some mapping on the
gridftp server side to relay data to hadoop.

I think this is interesting if it works. Basically, this is a "push" mode.

Even better: "pull mode", I still want sth built into hadoop (so, its running in map/reduce)
that acts like "hadoop distcp s3://123:456@nutch/ hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998"
or "hadoop distcp -f filelistA hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998"
and filelistA looks like
s3://123:456@nutch/file1
s3://123:456@nutch/fileN

So, just like accessing local files, we might have sth like "hadoop distcp file://remotehost/path
hdfs://namenode/path" or "hadoop distcp -f filelistB hdfs://hostname/path" and filelistB looks
like

file://remotehost/path1/file1
file://remorehost/path2/fileN

(file:// works for local file system, but in this case it points to remote file system, or
replace it with sth like remote://), so, some middleware will sit on remote host and the namenode
to exchange data, in this case, the gridftp?, if they agree on protocols (ports, etc.)

If security is an issue, data can be gpg encrypted before doing a "distcp".

Thanks,
--

Michael

--- On Tue, 3/2/10, Brian Bockelman <bbockelm@cse.unl.edu> wrote:

From: Brian Bockelman <bbockelm@cse.unl.edu>
Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
To: common-user@hadoop.apache.org
Date: Tuesday, March 2, 2010, 3:00 PM

Hey Michael,

We've developed a GridFTP server plugin that writes directly into Hadoop, so there's no intermediate
data staging required.  You can just use your favorite GridFTP client on the source machine
and transfer it directly into Hadoop.  Globus GridFTP can do checksums as it goes, but I
haven't tried it - it might not work with our plugin.  The GridFTP server does not need to
co-exist with any Hadoop processes - it just needs a network connection to the WAN and a network
connection to the LAN.

The GridFTP server is automatically installed with our yum packaging, along with our organization's
CA certs.  If this is a one-off transfer - or you don't already have the CA certificate/grid
infrastructure already available in your organization - you might be better served approaching
another solution.

The setup works well for us because (a) the other 40 sites use GridFTP as a common protocol,
(b) we have a long history with using GridFTP, and (c) we need to transfer many TB on a daily
basis.

Brian

On Mar 2, 2010, at 12:10 PM, jiang licht wrote:

> Hi Brian,
> 
> Thanks a lot for sharing your experience. Here I have some questions to bother you for
more help :)
> 
> So, basically means that data transfer in your case is 2-step job: 1st, use gridftp to
make a local copy of data on target, 2nd load data into the target cluster by sth like "hadoop
fs -put". If this is correct, I am wondering if this will consume too much disk space of your
target box (since it is stored in a local file system, prior to be distributed to hadoop cluster).
Also, do you do a integrity check for each file transferred (one straightforward method might
be to do a 'cksum' or alike comparison, but is that doable in terms of efficiency)?
> 
> I am not familiar with gridftp except that I know it is a better choice compared to scp,
sftp, etc. in that it can tune tcp settings and create parallel transfer. So, I want to know
if it keeps a log of what files have been successfully transferred and what have not, does
gridftp do a file integrity check? Right now, I only have one box for data storage (not in
hadoop cluster) and want to transfer that data to hadoop. Can I just install gridftp on this
box and name node box to enable gridftp transfer from the 1st to the 2nd?
> 
> Thanks,
> --
> 
> Michael
> 
> --- On Tue, 3/2/10, Brian Bockelman <bbockelm@cse.unl.edu> wrote:
> 
> From: Brian Bockelman <bbockelm@cse.unl.edu>
> Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
> To: common-user@hadoop.apache.org
> Date: Tuesday, March 2, 2010, 8:38 AM
> 
> Hey Michael,
> 
> distcp does a MapReduce job to transfer data between two clusters - but it might not
be acceptable security-wise for your setup.
> 
> Locally, we use gridftp between two clusters (not necessarily Hadoop!) and a protocol
called SRM to load-balance between gridftp servers.  GridFTP was selected because it is common
in our field, and we already have the certificate infrastructure well setup.
> 
> GridFTP is fast too - many Gbps is not too hard.
> 
> YMMV
> 
> Brian
> 
> On Mar 2, 2010, at 1:30 AM, jiang licht wrote:
> 
>> I am considering a basic task of loading data to hadoop cluster in this scenario:
hadoop cluster and bulk data reside on different boxes, e.g. connected via LAN or wan.
>>   
>> An example to do this is to move data from amazon s3 to ec2, which is supported in
latest hadoop by specifying s3(n)://authority/path in distcp.
>>   
>> But generally speaking, what is the best way to load data to hadoop cluster from
a remote box? Clearly, in this scenario, it is unreasonable to copy data to local name node
and then issue some command like "hadoop fs -copyFromLocal" to put data in the cluster (besides
this, a desired data transfer tool is also a factor, scp or sftp, gridftp, ..., compression
and encryption, ...).
>>   
>> I am not awaring of a generic support for fetching data from a remote box (like that
from s3 or s3n), I am thinking about the following solution (run on remote boxes to push data
to hadoop):
>>   
>> cat datafile | ssh hadoopbox 'hadoop fs -put - dst'
>>   
>> There are pros (simple and will do the job without storing a local copy of each data
file and then do a command like 'hadoop fs -copyFromLocal') and cons (obviously will need
many such pipelines running in parallel to speed up the job, but at the cost of creating processes
on remote machines to read data and maintain ssh connections, so if data file is small, better
archive small files into a tar file before calling 'cat'). Alternative to using a 'cat', a
program can be written to keep reading data files and dump to stdin in parallel.
>>   
>> Any comments about this or thoughts about a better solution?
>>   
>> Thanks,
>> --
>> Michael
>> 
>> 
> 
> 
> 
> 




      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message