hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-10442) Group look-up can cause segmentation fault when certain JNI-based mapping module is used.
Date Mon, 31 Mar 2014 19:38:16 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-10442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955576#comment-13955576

Kihwal Lee commented on HADOOP-10442:

[~cmccabe]:  I also think the version of nslcd we used is buggy.  The return code handling
before your change was just masking it, but it likely had other side effects.  I observed
many lookup timeouts in NN prior to crashes, while my own program calling the same libc functions
running on the same box at the same time had no issue.  The nslcd lookup timeout was configured
to be 20 seconds in /etc/nslcd.conf.

12:15:21,106  WARN security.Groups: Potential performance problem:
getGroups(user=xxxx) took 20020 milliseconds.
 12:15:21,107  WARN security.UserGroupInformation: No groups available for user xxxx

bq. Also, looking at this more closely, I believe we mishandle the case where the user is
a member of no groups. This would be a pretty odd configuration (I wonder if it's possible?).

Getting no groups after a successful getpwnam() can probably only happen when the user was
removed in between the two calls. All other cases might be considered as errors.  I saw cases
of an admin user getting permission refused for certain operations. It was fixed after the
refresh command was issued.  It must have hit the no-group error when building the acl and
the result was negatively cached. If it didn't do negative caching, user-level retries would
have worked.

So, the solution might be letting the native code return 0 even on error conditions as you
suggested, but making netgroup modules not do negative caching.  That's when a valid user
name has no netgroups.

> Group look-up can cause segmentation fault when certain JNI-based mapping module is used.
> -----------------------------------------------------------------------------------------
>                 Key: HADOOP-10442
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10442
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.3.0, 2.4.0
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Blocker
>             Fix For: 3.0.0, 2.4.0, 2.5.0
>         Attachments: HADOOP-10442.patch
> When JniBasedUnixGroupsNetgroupMapping or JniBasedUnixGroupsMapping is used, we get segmentation
fault very often. The same system ran 2.2 for months without any problem, but as soon as upgrading
to 2.3, it started crashing.  This resulted in multiple name node crashes per day.
> The server was running nslcd (nss-pam-ldapd-0.7.5-15.el6_3.2). We did not see this problem
on the servers running sssd. 
> There was one change in the C code and it modified the return code handling after getgrouplist()
call. If the function returns 0 or a negative value less than -1, it will do realloc() instead
of returning failure.

This message was sent by Atlassian JIRA

View raw message