hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arpit Agarwal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13263) Reload cached groups in background after expiry
Date Tue, 14 Jun 2016 22:55:30 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330817#comment-15330817

Arpit Agarwal commented on HADOOP-13263:

The patch looks good, thanks for contributing it.

A few comments:
# We should set keepAliveTime for thread pool threads so we don't always pay the fixed cost.
e.g. [via this constructor|https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadPoolExecutor.html#ThreadPoolExecutor(int,%20int,%20long,%20java.util.concurrent.TimeUnit,%20java.util.concurrent.BlockingQueue)].
I'd suggest [newCachedThreadPool|https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Executors.html#newCachedThreadPool()]
but it doesn't appear to support maximum number of threads.
# Can you avoid using the static block in {{GroupCacheLoader}}? The thread pool can be instantiated
at the first async reload.
# Perhaps the default should be 3-5 threads instead of 1. With keepAliveTime these threads
won't be using up memory all the time.
# I don't think there can be a generic way for {{reload}} to timeout the {{Future}} since
we don't require the {{GroupMappingServiceProvider}} implementations to support one so we
can remove this comment.
     // call? If so, do we need a way of timing out the future
     // so all the threads don't get blocked?

Couple of counters Groups can expose now or later in another. These could be NameNode metrics
# The number of pending cache reloads blocked on executors i.e. counter incremented just before
{{executorService#submit}} and decremented when the call is invoked.
# Number of load operations currently in progress.

> Reload cached groups in background after expiry
> -----------------------------------------------
>                 Key: HADOOP-13263
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13263
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>         Attachments: HADOOP-13263.001.patch
> In HADOOP-11238 the Guava cache was introduced to allow refreshes on the Namenode group
cache to run in the background, avoiding many slow group lookups. Even with this change, I
have seen quite a few clusters with issues due to slow group lookups. The problem is most
prevalent in HA clusters, where a slow group lookup on the hdfs user can fail to return for
over 45 seconds causing the Failover Controller to kill it.
> The way the current Guava cache implementation works is approximately:
> 1) On initial load, the first thread to request groups for a given user blocks until
it returns. Any subsequent threads requesting that user block until that first thread populates
the cache.
> 2) When the key expires, the first thread to hit the cache after expiry blocks. While
it is blocked, other threads will return the old value.
> I feel it is this blocking thread that still gives the Namenode issues on slow group
lookups. If the call from the FC is the one that blocks and lookups are slow, if can cause
the NN to be killed.
> Guava has the ability to refresh expired keys completely in the background, where the
first thread that hits an expired key schedules a background cache reload, but still returns
the old value. Then the cache is eventually updated. This patch introduces this background
reload feature. There are two new parameters:
> 1) hadoop.security.groups.cache.background.reload - default false to keep the current
behaviour. Set to true to enable a small thread pool and background refresh for expired keys
> 2) hadoop.security.groups.cache.background.reload.threads - only relevant if the above
is set to true. Controls how many threads are in the background refresh pool. Default is 1,
which is likely to be enough.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message