hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joydeep Sen Sarma <jssa...@facebook.com>
Subject RE: public IP for datanode on EC2
Date Thu, 14 May 2009 04:37:33 GMT
Thanks Philip. Very helpful (and great blog post)! This seems to make basic dfs command line
operations work just fine.

However - I am hitting a new error during job submission (running hadoop-0.19.0):

2009-05-14 00:15:34,430 ERROR exec.ExecDriver (SessionState.java:printError(279)) - Job Submission
failed with exception 'java.net.UnknownHostException(unknown host: domU-12-31-39-00-51-94.compute-1.internal)'
java.net.UnknownHostException: unknown host: domU-12-31-39-00-51-94.compute-1.internal
	at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:195)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:791)
	at org.apache.hadoop.ipc.Client.call(Client.java:686)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
	at $Proxy0.getProtocolVersion(Unknown Source)
	at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
	at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:176)
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
	at org.apache.hadoop.mapred.JobClient.getFs(JobClient.java:469)
	at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:603)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)


looking at the stack trace and the code - it seems that this is happening because the jobclient
asks for the mapred system directory from the jobtracker - which replies back with a path
name that's qualified against the fs.default.name setting of the jobtracker. Unfortunately
the standard EC2 scripts assign this to the internal hostname of the hadoop master.

Is there any downside to using public hostnames instead of the private ones in the ec2 starter
scripts?

Thanks for the help,

Joydeep


-----Original Message-----
From: Philip Zeyliger [mailto:philip@cloudera.com] 
Sent: Wednesday, May 13, 2009 2:40 PM
To: core-user@hadoop.apache.org
Subject: Re: public IP for datanode on EC2

On Tue, May 12, 2009 at 9:11 PM, Joydeep Sen Sarma <jssarma@facebook.com> wrote:
> (raking up real old thread)
>
> After struggling with this issue for sometime now - it seems that accessing hdfs on ec2
from outside ec2 is not possible.
>
> This is primarily because of https://issues.apache.org/jira/browse/HADOOP-985. Even if
datanode ports are authorized in ec2 and we set the public hostname via slave.host.name -
the namenode uses the internal IP address of the datanodes for block locations. DFS clients
outside ec2 cannot reach these addresses and report failure reading/writing data blocks.
>
> HDFS/EC2 gurus - would it be reasonable to ask for an option to not use IP addresses
(and use datanode host names as pre-985)?
>
> I really like the idea of being able to use an external node (my personal workstation)
to do job submission (which typically requires interacting with HDFS in order to push files
into the jobcache etc). This way I don't need custom AMIs - I can use stock hadoop amis (all
the custom software is on the external node). Without the above option - this is not possible
currently.

You could use ssh to set up a SOCKS proxy between your machine and
ec2, and setup org.apache.hadoop.net.SocksSocketFactory to be the
socket factory.
http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
has more information.

-- Philip

Mime
View raw message