hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daryn Sharp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4222) NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues
Date Thu, 21 Feb 2013 14:56:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13583245#comment-13583245
] 

Daryn Sharp commented on HDFS-4222:
-----------------------------------

bq. Not sure how to make this work. When does thread local variable get initialized and when
is it cleared, given a thread gets used for different current users?

Perhaps init-ed in the same places where {{getPermissionChecker}} is being invoked, or ideally
at a higher level to avoid all command methods from having "to do the right".

bq. bq. Another thought might be an option to tell a UGI to "lock-in" it's group list. Something
earlier on at a high level, maybe the NN's RPC server, could call UserGroupInformation.getCurrentUser().lockGroups().
bq. Not sure I understood this.

"lockGroups" would internally fetch the groups and then make them immutable in the UGI.  It
could be invoked where {{getPermissionChecker}} is being invoked, or ideally at a higher level
chokepoint for calls so it's a one-line change.  Maybe in the rpc call's doAs since a call
shouldn't be running long enough that the groups will change.  This would inoculate future
methods or overlooked methods from taking the lookup penalty within a lock.

In either case, I'm just trying to think of how to simplify the change and future-proof against
similar issues.  Again though, I really like this change.
                
> NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to use LDAP and
LDAP has issues
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-4222
>                 URL: https://issues.apache.org/jira/browse/HDFS-4222
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 1.0.0, 0.23.3, 2.0.0-alpha
>            Reporter: Xiaobo Peng
>            Assignee: Xiaobo Peng
>            Priority: Minor
>         Attachments: hdfs-4222-branch-0.23.3.patch, HDFS-4222.patch, HDFS-4222.patch,
hdfs-4222-release-1.0.3.patch
>
>
> For Hadoop clusters configured to access directory information by LDAP, the FSNamesystem
calls on behave of DFS clients might hang due to LDAP issues (including LDAP access issues
caused by networking issues) while holding the single lock of FSNamesystem. That will result
in the NN unresponsive and loss of the heartbeats from DNs.
> The places LDAP got accessed by FSNamesystem calls are the instantiation of FSPermissionChecker,
which could be moved out of the lock scope since the instantiation does not need the FSNamesystem
lock. After the move, a DFS client hang will not affect other threads by hogging the single
lock. This is especially helpful when we use separate RPC servers for ClientProtocol and DatanodeProtocol
since the calls for DatanodeProtocol do not need to access LDAP. So even if DFS clients hang
due to LDAP issues, the NN will still be able to process the requests (including heartbeats)
from DNs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message