hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: DataNode and Tasttracker communication
Date Mon, 13 Aug 2012 13:36:03 GMT
Based on your /etc/hosts output, why aren't you using DNS? 

Outside of MapR, multihomed machines can be problematic. Hadoop doesn't generally work well
when you're not using the FQDN or its alias. 

The issue isn't the SSH, but if you go to the node which is having trouble connecting to another
node,  then try to ping it, or some other general communication,  if it succeeds, your issue
is that the port you're trying to communicate with is blocked.  Then its more than likely
an ipconfig or firewall issue.

On Aug 13, 2012, at 8:17 AM, Björn-Elmar Macek <ema@cs.uni-kassel.de> wrote:

> Hi Michael,
> 
> well i can ssh from any node to any other without being prompted. The reason for this
is, that my home dir is mounted in every server in the cluster. 
> 
> If the machines are multihomed: i dont know. i could ask if this would be of importance.
> 
> Shall i?
> 
> Regards,
> Elmar
> 
> Am 13.08.12 14:59, schrieb Michael Segel:
>> If the nodes can communicate and distribute data, then the odds are that the issue
isn't going to be in his /etc/hosts. 
>> 
>> A more relevant question is if he's running a firewall on each of these machines?

>> 
>> A simple test... ssh to one node, ping other nodes and the control nodes at random
to see if they can see one another. Then check to see if there is a firewall running which
would limit the types of traffic between nodes. 
>> 
>> One other side note... are these machines multi-homed?
>> 
>> On Aug 13, 2012, at 7:51 AM, Mohammad Tariq <dontariq@gmail.com> wrote:
>> 
>>> Hello there,
>>> 
>>>      Could you please share your /etc/hosts file, if you don't mind.
>>> 
>>> Regards,
>>>     Mohammad Tariq
>>> 
>>> 
>>> 
>>> On Mon, Aug 13, 2012 at 6:01 PM, Björn-Elmar Macek <macek@cs.uni-kassel.de>
wrote:
>>> Hi,
>>> 
>>> i am currently trying to run my hadoop program on a cluster. Sadly though my
datanodes and tasktrackers seem to have difficulties with their communication as their logs
say:
>>> * Some datanodes and tasktrackers seem to have portproblems of some kind as it
can be seen in the logs below. I wondered if this might be due to reasons correllated with
the localhost entry in /etc/hosts as you can read in alot of posts with similar errors, but
i checked the file neither localhost nor 127.0.0.1/127.0.1.1 is bound there. (although you
can ping localhost... the technician of the cluster said he'd be looking for the mechanics
resolving localhost)
>>> * The other nodes can not speak with the namenode and jobtracker (its-cs131).
Although it is absolutely not clear, why this is happening: the "dfs -put" i do directly before
the job is running fine, which seems to imply that communication between those servers is
working flawlessly.
>>> 
>>> Is there any reason why this might happen?
>>> 
>>> 
>>> Regards,
>>> Elmar
>>> 
>>> LOGS BELOW:
>>> 
>>> \____Datanodes
>>> 
>>> After successfully putting the data to hdfs (at this point i thought namenode
and datanodes have to communicate), i get the following errors when starting the job:
>>> 
>>> There are 2 kinds of logs i found: the first one is big (about 12MB) and looks
like this:
>>> ############################### LOG TYPE 1 ############################################################
>>> 2012-08-13 08:23:27,331 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35554. Already tried 0 time(s).
>>> 2012-08-13 08:23:28,332 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35554. Already tried 1 time(s).
>>> 2012-08-13 08:23:29,332 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35554. Already tried 2 time(s).
>>> 2012-08-13 08:23:30,332 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35554. Already tried 3 time(s).
>>> 2012-08-13 08:23:31,333 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35554. Already tried 4 time(s).
>>> 2012-08-13 08:23:32,333 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35554. Already tried 5 time(s).
>>> 2012-08-13 08:23:33,334 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35554. Already tried 6 time(s).
>>> 2012-08-13 08:23:34,334 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35554. Already tried 7 time(s).
>>> 2012-08-13 08:23:35,334 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35554. Already tried 8 time(s).
>>> 2012-08-13 08:23:36,335 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35554. Already tried 9 time(s).
>>> 2012-08-13 08:23:36,335 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
java.net.ConnectException: Call to its-cs131/141.51.205.41:35554 failed on connection exception:
java.net.ConnectException: Connection refused
>>>     at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)
>>>     at org.apache.hadoop.ipc.Client.call(Client.java:1071)
>>>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>>>     at $Proxy5.sendHeartbeat(Unknown Source)
>>>     at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:904)
>>>     at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1458)
>>>     at java.lang.Thread.run(Thread.java:619)
>>> Caused by: java.net.ConnectException: Connection refused
>>>     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>     at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>>>     at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>>>     at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489)
>>>     at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)
>>>     at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)
>>>     at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)
>>>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202)
>>>     at org.apache.hadoop.ipc.Client.call(Client.java:1046)
>>>     ... 5 more
>>> 
>>> ... (this continues til the end of the log)
>>> 
>>> The second is short kind:
>>> ########################### LOG TYPE 2 ############################################################
>>> 2012-08-13 00:59:19,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
STARTUP_MSG:
>>> /************************************************************
>>> STARTUP_MSG: Starting DataNode
>>> STARTUP_MSG:   host = its-cs133.its.uni-kassel.de/141.51.205.43
>>> STARTUP_MSG:   args = []
>>> STARTUP_MSG:   version = 1.0.2
>>> STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0.2
-r 1304954; compiled by 'hortonfo' on Sat Mar 24 23:58:21 UTC 2012
>>> ************************************************************/
>>> 2012-08-13 00:59:19,203 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded
properties from hadoop-metrics2.properties
>>> 2012-08-13 00:59:19,216 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter:
MBean for source MetricsSystem,sub=Stats registered.
>>> 2012-08-13 00:59:19,217 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
Scheduled snapshot period at 10 second(s).
>>> 2012-08-13 00:59:19,218 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
DataNode metrics system started
>>> 2012-08-13 00:59:19,306 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter:
MBean for source ugi registered.
>>> 2012-08-13 00:59:19,346 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded
the native-hadoop library
>>> 2012-08-13 00:59:20,482 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35554. Already tried 0 time(s).
>>> 2012-08-13 00:59:21,584 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage
directory /home/work/bmacek/hadoop/hdfs/slave is not formatted.
>>> 2012-08-13 00:59:21,584 INFO org.apache.hadoop.hdfs.server.common.Storage: Formatting
...
>>> 2012-08-13 00:59:21,787 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Registered FSDatasetStatusMBean
>>> 2012-08-13 00:59:21,897 INFO org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
Shutting down all async disk service threads...
>>> 2012-08-13 00:59:21,897 INFO org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
All async disk service threads have been shut down.
>>> 2012-08-13 00:59:21,898 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
java.net.BindException: Problem binding to /0.0.0.0:50010 : Address already in use
>>>     at org.apache.hadoop.ipc.Server.bind(Server.java:227)
>>>     at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:404)
>>>     at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:299)
>>>     at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1582)
>>>     at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1521)
>>>     at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1539)
>>>     at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1665)
>>>     at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1682)
>>> Caused by: java.net.BindException: Address already in use
>>>     at sun.nio.ch.Net.bind(Native Method)
>>>     at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119)
>>>     at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
>>>     at org.apache.hadoop.ipc.Server.bind(Server.java:225)
>>>     ... 7 more
>>> 
>>> 2012-08-13 00:59:21,899 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
SHUTDOWN_MSG:
>>> /************************************************************
>>> SHUTDOWN_MSG: Shutting down DataNode at its-cs133.its.uni-kassel.de/141.51.205.43
>>> ************************************************************/
>>> 
>>> 
>>> 
>>> 
>>> 
>>> \_____TastTracker
>>> With TaskTrackers it is the same: there are 2 kinds.
>>> ############################### LOG TYPE 1 ############################################################
>>> 2012-08-13 02:09:54,645 INFO org.apache.hadoop.mapred.TaskTracker: Resending
'status' to 'its-cs131' with reponseId '879
>>> 2012-08-13 02:09:55,646 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35555. Already tried 0 time(s).
>>> 2012-08-13 02:09:56,646 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35555. Already tried 1 time(s).
>>> 2012-08-13 02:09:57,647 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35555. Already tried 2 time(s).
>>> 2012-08-13 02:09:58,647 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35555. Already tried 3 time(s).
>>> 2012-08-13 02:09:59,648 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35555. Already tried 4 time(s).
>>> 2012-08-13 02:10:00,648 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35555. Already tried 5 time(s).
>>> 2012-08-13 02:10:01,649 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35555. Already tried 6 time(s).
>>> 2012-08-13 02:10:02,649 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35555. Already tried 7 time(s).
>>> 2012-08-13 02:10:03,650 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35555. Already tried 8 time(s).
>>> 2012-08-13 02:10:04,650 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35555. Already tried 9 time(s).
>>> 2012-08-13 02:10:04,651 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception:
java.net.ConnectException: Call to its-cs131/141.51.205.41:35555 failed on connection exception:
java.net.ConnectException: Connection refused
>>>     at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)
>>>     at org.apache.hadoop.ipc.Client.call(Client.java:1071)
>>>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>>>     at org.apache.hadoop.mapred.$Proxy5.heartbeat(Unknown Source)
>>>     at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1857)
>>>     at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1653)
>>>     at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2503)
>>>     at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3744)
>>> Caused by: java.net.ConnectException: Connection refused
>>>     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>     at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>>>     at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>>>     at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489)
>>>     at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)
>>>     at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)
>>>     at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)
>>>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202)
>>>     at org.apache.hadoop.ipc.Client.call(Client.java:1046)
>>>     ... 6 more
>>> 
>>> 
>>> ########################### LOG TYPE 2 ############################################################
>>> 2012-08-13 00:59:24,376 INFO org.apache.hadoop.mapred.TaskTracker: STARTUP_MSG:
>>> /************************************************************
>>> STARTUP_MSG: Starting TaskTracker
>>> STARTUP_MSG:   host = its-cs133.its.uni-kassel.de/141.51.205.43
>>> STARTUP_MSG:   args = []
>>> STARTUP_MSG:   version = 1.0.2
>>> STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0.2
-r 1304954; compiled by 'hortonfo' on Sat Mar 24 23:58:21 UTC 2012
>>> ************************************************************/
>>> 2012-08-13 00:59:24,569 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded
properties from hadoop-metrics2.properties
>>> 2012-08-13 00:59:24,626 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter:
MBean for source MetricsSystem,sub=Stats registered.
>>> 2012-08-13 00:59:24,627 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
Scheduled snapshot period at 10 second(s).
>>> 2012-08-13 00:59:24,627 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
TaskTracker metrics system started
>>> 2012-08-13 00:59:24,950 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter:
MBean for source ugi registered.
>>> 2012-08-13 00:59:25,146 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log)
via org.mortbay.log.Slf4jLog
>>> 2012-08-13 00:59:25,206 INFO org.apache.hadoop.http.HttpServer: Added global
filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
>>> 2012-08-13 00:59:25,232 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing
logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
>>> 2012-08-13 00:59:25,237 INFO org.apache.hadoop.mapred.TaskTracker: Starting tasktracker
with owner as bmacek
>>> 2012-08-13 00:59:25,239 INFO org.apache.hadoop.mapred.TaskTracker: Good mapred
local directories are: /home/work/bmacek/hadoop/hdfs/tmp/mapred/local
>>> 2012-08-13 00:59:25,244 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded
the native-hadoop library
>>> 2012-08-13 00:59:25,255 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter:
MBean for source jvm registered.
>>> 2012-08-13 00:59:25,256 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter:
MBean for source TaskTrackerMetrics registered.
>>> 2012-08-13 00:59:25,279 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
>>> 2012-08-13 00:59:25,282 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter:
MBean for source RpcDetailedActivityForPort54850 registered.
>>> 2012-08-13 00:59:25,282 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter:
MBean for source RpcActivityForPort54850 registered.
>>> 2012-08-13 00:59:25,287 INFO org.apache.hadoop.ipc.Server: IPC Server Responder:
starting
>>> 2012-08-13 00:59:25,288 INFO org.apache.hadoop.ipc.Server: IPC Server listener
on 54850: starting
>>> 2012-08-13 00:59:25,288 INFO org.apache.hadoop.ipc.Server: IPC Server handler
0 on 54850: starting
>>> 2012-08-13 00:59:25,288 INFO org.apache.hadoop.ipc.Server: IPC Server handler
1 on 54850: starting
>>> 2012-08-13 00:59:25,289 INFO org.apache.hadoop.mapred.TaskTracker: TaskTracker
up at: localhost/127.0.0.1:54850
>>> 2012-08-13 00:59:25,289 INFO org.apache.hadoop.ipc.Server: IPC Server handler
3 on 54850: starting
>>> 2012-08-13 00:59:25,289 INFO org.apache.hadoop.ipc.Server: IPC Server handler
2 on 54850: starting
>>> 2012-08-13 00:59:25,289 INFO org.apache.hadoop.mapred.TaskTracker: Starting tracker
tracker_its-cs133.its.uni-kassel.de:localhost/127.0.0.1:54850
>>> 2012-08-13 00:59:26,321 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: its-cs131/141.51.205.41:35555. Already tried 0 time(s).
>>> 2012-08-13 00:59:38,104 INFO org.apache.hadoop.mapred.TaskTracker: Starting thread:
Map-events fetcher for all reduce tasks on tracker_its-cs133.its.uni-kassel.de:localhost/127.0.0.1:54850
>>> 2012-08-13 00:59:38,120 INFO org.apache.hadoop.util.ProcessTree: setsid exited
with exit code 0
>>> 2012-08-13 00:59:38,134 INFO org.apache.hadoop.mapred.TaskTracker: Using ResourceCalculatorPlugin
: org.apache.hadoop.util.LinuxResourceCalculatorPlugin@445e228
>>> 2012-08-13 00:59:38,137 WARN org.apache.hadoop.mapred.TaskTracker: TaskTracker's
totalMemoryAllottedForTasks is -1. TaskMemoryManager is disabled.
>>> 2012-08-13 00:59:38,145 INFO org.apache.hadoop.mapred.IndexCache: IndexCache
created with max memory = 10485760
>>> 2012-08-13 00:59:38,158 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter:
MBean for source ShuffleServerMetrics registered.
>>> 2012-08-13 00:59:38,161 INFO org.apache.hadoop.http.HttpServer: Port returned
by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on
50060
>>> 2012-08-13 00:59:38,161 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start
task tracker because java.net.BindException: Address already in use
>>>     at sun.nio.ch.Net.bind(Native Method)
>>>     at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119)
>>>     at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
>>>     at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
>>>     at org.apache.hadoop.http.HttpServer.start(HttpServer.java:581)
>>>     at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:1502)
>>>     at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3742)
>>> 
>>> 2012-08-13 00:59:38,163 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG:
>>> /************************************************************
>>> SHUTDOWN_MSG: Shutting down TaskTracker at its-cs133.its.uni-kassel.de/141.51.205.43
>>> ************************************************************/
>>> 
>> 
> 


Mime
View raw message