hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jiang licht <licht_ji...@yahoo.com>
Subject How namenode/datanode and jobtracker/tasktracker negotiate connection?
Date Tue, 09 Mar 2010 10:17:13 GMT
What are the exact packets and steps used to establish a namenode/datanode connection and 
jobtracker/tasktracker connection?

I am asking this due to a weird problem related to starting datanodes and tasktrackers. 

In my case, the namenode box has 2 ethernet interfaces combined as bond0 interface with IP
address of IP_A and there is an IP alias IP_B for local loopback interface as lo:1. All slave
boxes sit on the same network segment as IP_B.

The network is configured such that no slave box can reach namenode box at IP_A but namenode
box can reach slave boxes (clearly can only routed from bond0). So, slave boxes always use
"hdfs://IP_B:50001" as "fs.default.name" in "core-site.xml" and use IP_B:50002" for job tracker
in mapred-site.xml to reach namenode box.

There are the following 2 cases how namenode (or jobtracker) is configured on namenode box.

Case #1: If I set "fs.default.name" to "hdfs://IP_B:50001", no slave boxes can join the cluster
as data nodes because the request to IP_B:50001 failed. "telnet IP_B 50001" on slave boxes
resulted in connection refused. So, on namenode box, I fired "tcpdump -i bond0 tcp port 50001"
and then from a slave box did a "telnet IP_B 5001" and watched for incoming and outgoing packets
on namenode box.

Case #2: If I set "fs.default.name" to "hdfs://IP_A:50001", slave boxes can join the cluster
as data nodes. And I did the same thing to use tcpdump and telnet to watch the traffic. I
compared these two cases and found some difference in the traffic. So, I want to know if there
is a hand-shaking stage for namenode and datanode to establish a connection and what are the
packets for this purpose so that I can figure out if packets exchanged in case #1 are correct
or not, which may reveal why the connection request from data node to name node fails.

Also in Case #2, although all slave boxes can join the cluster as datanodes, no slave box
can start as a tasktracker because at the beginning of starting a tasktracker, the tasktracker
box uses IP_A:50001 to request connection to namenode and as mentioned above (slaves are not
allowed to reach namenode at IP_A but reverse direction is ok), this cannot be done. But my
confusion here is that on all slave boxes "fs.default.name" is set to use IP_B:50001, how
come it ended up with contacting the namenode with IP_A:50001?

A bit complicated. But any thoughts?



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message