hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jiang licht <licht_ji...@yahoo.com>
Subject Re: bulk data transfer to HDFS remotely (e.g. via wan)
Date Tue, 02 Mar 2010 08:00:43 GMT
Another thoughts, to speed up partition and distribution of data in hadoop cluster, it seems
to me it is favorable if the transfer task can be run as a map/reduce job. Make any sense?


--- On Tue, 3/2/10, jiang licht <licht_jiang@yahoo.com> wrote:

From: jiang licht <licht_jiang@yahoo.com>
Subject: bulk data transfer to HDFS remotely (e.g. via wan)
To: common-user@hadoop.apache.org
Date: Tuesday, March 2, 2010, 1:30 AM

I am considering a basic task of loading data to hadoop cluster in this scenario: hadoop cluster
and bulk data reside on different boxes, e.g. connected via LAN or wan.
An example to do this is to move data from amazon s3 to ec2, which is supported in latest
hadoop by specifying s3(n)://authority/path in distcp.
But generally speaking, what is the best way to load data to hadoop cluster from a remote
box? Clearly, in this scenario, it is unreasonable to copy data to local name node and then
issue some command like "hadoop fs -copyFromLocal" to put data in the cluster (besides this,
a desired data transfer tool is also a factor, scp or sftp, gridftp, ..., compression and
encryption, ...).
I am not awaring of a generic support for fetching data from a remote box (like that from
s3 or s3n), I am thinking about the following solution (run on remote boxes to push data to
cat datafile | ssh hadoopbox 'hadoop fs -put - dst'
There are pros (simple and will do the job without storing a local copy of each data file
and then do a command like 'hadoop fs -copyFromLocal') and cons (obviously will need many
such pipelines running in parallel to speed up the job, but at the cost of creating processes
on remote machines to read data and maintain ssh connections, so if data file is small,
better archive small files into a tar file before calling 'cat'). Alternative to using a
'cat', a program can be written to keep reading data files and dump to stdin in parallel.
Any comments about this or thoughts about a better solution?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message