ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AMBARI-8768) Ambari agent Heartbeat lost when df hangs (NFS gateway), also prevents proper re-initialization of agent upon restart
Date Wed, 20 May 2015 19:13:02 GMT

    [ https://issues.apache.org/jira/browse/AMBARI-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552893#comment-14552893
] 

Hudson commented on AMBARI-8768:
--------------------------------

FAILURE: Integrated in Ambari-trunk-Commit #2660 (See [https://builds.apache.org/job/Ambari-trunk-Commit/2660/])
AMBARI-8768 Ambari agent Heartbeat lost when df hangs (NFS gateway), also prevents proper
re-initialization of agent upon restart (additional patch) (dsen) (dsen: http://git-wip-us.apache.org/repos/asf?p=ambari.git&a=commit&h=832f3b9c3dc64997e1a5dbccc585d9acb3e3591c)
* ambari-agent/src/main/python/ambari_agent/Hardware.py


> Ambari agent Heartbeat lost when df hangs (NFS gateway), also prevents proper re-initialization
of agent upon restart
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMBARI-8768
>                 URL: https://issues.apache.org/jira/browse/AMBARI-8768
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-agent
>    Affects Versions: 1.7.0
>         Environment: HDP 2.1
>            Reporter: Hari Sekhon
>            Assignee: Dmytro Sen
>         Attachments: AMBARI-8768.patch
>
>
> Ambari agent is succeptible to hanging when the 'df' command blocks. This causes loss
of heartbeat and manageability. I've found this has happened with NFS gateway's HDFS mount
point blocking when HDFS isn't available (we had set the NFS soft option on the mount point
but then realized that wasn't a good idea as not everyone's processes and scripts will handle
failure gracefully and retry properly).
> When restarting the agent it also leaves the df process bound to point 8670 which requires
manually killing that in order to get the ambari agent to restart and bind successfully, but
even then you'll see a hang at this point after connecting to the 8440 ca and the agent never
fully initializes so the heartbeat still never comes back.
> The df command should be either in another thread non-blocking the main heartbeat and
management functions or should have a timeout set on the command execution to prevent this
issue.
> Regards,
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message