hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephen O'Donnell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13263) Reload cached groups in background after expiry
Date Fri, 24 Jun 2016 21:07:16 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348631#comment-15348631

Stephen O'Donnell commented on HADOOP-13263:

[~jojochuang] Thanks for the review.

What's the purpose of getBackgroundRefreshSuccess(), getBackgroundRefreshException, getBackgroundRefreshQueued,
getBackgroundRefreshRunning in Group class?

[~arpitagarwal] suggested that we put some counters in that can be exposed as Namenode metrics
in a further Jira. I think it makes sense, otherwise it will be impossible to know in a running
system if the refresh queue is getting very large, or if refreshes are hitting an exception

I also wonder if the new properties should be defined in CommonConfigurationKeys instead
I am happy to move these if you want. I found the existing group cache parameters in `CommonConfigurationKeysPublic`
so I kept them together. Let me know if you want me to move them and I can submit another
patch version.

Do you want me to update the GroupsMapping.md and core-default.xml within this Jira so it
all gets committed together, or should we do docs separately? I've got the following ready
to go:

    Whether to reload expired user->group mappings using a background thread
    pool. If set to true, a pool of
    hadoop.security.groups.cache.background.reload.threads is created to
    update the cache in the background.

    Only relevant if hadoop.security.groups.cache.background.reload is true.
    Controls the number of concurrent background user->group cache entry
    refreshes. Pending refresh requests beyond this value are queued and
    processed when a thread is free.

And for the groupMapping.md:

With the default caching implementation, after `hadoop.security.groups.cache.secs` when the
cache entry expires, the next thread to request group membership will query the group mapping
service provider to lookup the current groups for the user. While this lookup is running,
the thread that initiated it will block, while any other threads requesting groups for the
same user will retrieve the previously cached values. If the refresh fails, the thread performing
the refresh will throw an exception and the process will repeat for the next thread that requests
a lookup for that value. If the lookup repeatedly fails, and the cache is not updated, after
`hadoop.security.groups.cache.secs * 10` seconds the cached entry will be evicted and all
threads will block until a successful reload is performed.

To avoid any threads blocking when the cached entry expires, set `hadoop.security.groups.cache.background.reload`
to true. This enables a small thread pool of `hadoop.security.groups.cache.background.reload.threads`
threads having 3 threads by default. With this setting, when the cache is queried for an expired
entry, the expired result is returned immediately and a task is queued to refresh the cache
in the background. If the background refresh fails a new refresh operation will be queued
by the next request to the cache, until `hadoop.security.groups.cache.secs * 10` when the
cached entry will be evicted and all threads will block for that user until a successful reload

If you give this a quick review and let me know if it should be in this patch I can get a
new version pushed up pretty quickly.

> Reload cached groups in background after expiry
> -----------------------------------------------
>                 Key: HADOOP-13263
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13263
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>         Attachments: HADOOP-13263.001.patch, HADOOP-13263.002.patch, HADOOP-13263.003.patch,
HADOOP-13263.004.patch, HADOOP-13263.005.patch, HADOOP-13263.006.patch
> In HADOOP-11238 the Guava cache was introduced to allow refreshes on the Namenode group
cache to run in the background, avoiding many slow group lookups. Even with this change, I
have seen quite a few clusters with issues due to slow group lookups. The problem is most
prevalent in HA clusters, where a slow group lookup on the hdfs user can fail to return for
over 45 seconds causing the Failover Controller to kill it.
> The way the current Guava cache implementation works is approximately:
> 1) On initial load, the first thread to request groups for a given user blocks until
it returns. Any subsequent threads requesting that user block until that first thread populates
the cache.
> 2) When the key expires, the first thread to hit the cache after expiry blocks. While
it is blocked, other threads will return the old value.
> I feel it is this blocking thread that still gives the Namenode issues on slow group
lookups. If the call from the FC is the one that blocks and lookups are slow, if can cause
the NN to be killed.
> Guava has the ability to refresh expired keys completely in the background, where the
first thread that hits an expired key schedules a background cache reload, but still returns
the old value. Then the cache is eventually updated. This patch introduces this background
reload feature. There are two new parameters:
> 1) hadoop.security.groups.cache.background.reload - default false to keep the current
behaviour. Set to true to enable a small thread pool and background refresh for expired keys
> 2) hadoop.security.groups.cache.background.reload.threads - only relevant if the above
is set to true. Controls how many threads are in the background refresh pool. Default is 1,
which is likely to be enough.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message