hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jameel Al-Aziz <jam...@6sense.com>
Subject Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster
Date Sat, 20 Sep 2014 20:11:34 GMT
Hi Ankit,

We originally tried to copy to S3 and back. In fact, it is actually our fallback plan. We
were having issues with the copy to S3 not maintaining the directory layout, so we decided
to try and do a direct copy.

I'll give it another shot though!

Jameel Al-Aziz

From: Ankit Singhal <ankitsinghal59@gmail.com>
Sent: Sep 20, 2014 8:25 AM
To: user@hadoop.apache.org
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC
cluster

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3 and making MR job
reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts or DNS of all
the nodes of your old cluster with "publicIp   internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes of new cluster in
below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mohajeri@gmail.com<mailto:mohajeri@gmail.com>>
wrote:
It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <jameel@6sense.com<mailto:jameel@6sense.com>>
wrote:
Hi all,

We're in the process of migrating from EC2-Classic to VPC and needed to transfer our HDFS
data. We setup a new cluster inside the VPC, and assigned the name node and data node temporary
public IPs. Initially, we had a lot of trouble getting the name node to redirect to the public
hostname instead of private IPs. After some fiddling around, we finally got webhdfs and dfs
-cp to work using public hostnames. However, distcp simply refuses to use the public hostnames
when connecting to the data nodes.

We're running distcp on the old cluster, copying data into the new cluster.

The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.

So far, on the new cluster, we've tried:
- Using public DNS hostnames in the master and slaves files (on both the name node and data
nodes)
- Setting the hostname of all the boxes to their public DNS name
- Setting "fs.default.name<http://fs.default.name>" to the public DNS name of the new
name node.

And on both clusters:
- Setting the "dfs.datanode.use.datanode.hostname" and "dfs.client.use.datanode.hostname"
to "true" on both the old and new cluster.

Even though webhdfs is finally redirecting to data nodes using the public hostname, we keep
seeing errors when running distcp. The errors are all similar to: http://pastebin.com/ZYR07Fvm

What do we need to do to get distcp to use the public hostname of the new machines? I haven't
tried running distcp in the other direction (I'm about to), but I suspect I'll run into the
same problem.

Thanks!
Jameel



Mime
View raw message