ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Hurley (JIRA)" <>
Subject [jira] [Updated] (AMBARI-10464) Ambari Agent holding socket open on 50070 prevents NN from starting
Date Tue, 14 Apr 2015 14:10:12 GMT


Jonathan Hurley updated AMBARI-10464:
    Attachment: AMBARI-10464.patch

> Ambari Agent holding socket open on 50070 prevents NN from starting
> -------------------------------------------------------------------
>                 Key: AMBARI-10464
>                 URL:
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-agent
>    Affects Versions: 2.0.0
>            Reporter: Jonathan Hurley
>            Assignee: Jonathan Hurley
>            Priority: Critical
>             Fix For: 2.1.0
>         Attachments: AMBARI-10464.patch
> The Ambari Agent process appears to be listening on port 50070 and holding it open. This
is causing the NN to fail to start until the Ambari Agent is restarted. A netstat -natp reveals
that the agent process has this port open.
> {noformat}
> root@hdp2-02-01 hdfs]# netstat -anp | grep 50070
> tcp 0 0 ESTABLISHED 1630/python2.6
> {noformat}
> After digging some more through sockets and linux, I think it's entirely possible that
the agent could be assigned a source port that matches the destination port. Anything in the
ephemeral port range is up for grabs. Essentially what is happening here is that NN is down
and when the agent tries to check it via a socket connection to 50070, the source (client)
side of the socket connection binds to 50070 since it's open and within the range specified
by {{/proc/sys/net/ipv4/ip_local_port_range}}
> The client essentially connects to itself; the WEB alert connection timeout is set to
10 seconds. That means that after 10 seconds, it will release the connection automatically.
The METRIC alerts, however, use a slightly different mechanism of opening the socket and don't
specify the socket timeout. For a METRIC alert, when both the source and destination ports
are the same, it will connection and hold that connection for as long as {{socket._GLOBAL_DEFAULT_TIMEOUT}}
which could be a very long time.
> - I believe that we need to change METRIC alert to pass in a timeout value to the socket
(between 5 and 10 seconds just like WEB alerts)
> - Since the Hadoop components seem to use emphemeral ports that the OS says are free
game to any client, this will still end up being a problem. The above proposed fix will make
it so that the agent will release the socket after a while preventing the need to restart
the agent after fixing the problem. But it's still possible that the agent could bind to that
port when making its check.

This message was sent by Atlassian JIRA

View raw message