hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9409) DataNode shutdown does not guarantee full shutdown of all threads due to race condition.
Date Tue, 10 Nov 2015 20:14:11 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999257#comment-14999257

Chris Nauroth commented on HDFS-9409:

{{DataNode#shutdown}} calls {{BlockPoolManager#getAllNamenodeThreads}} to get every {{BPOfferService}}.
 Then, later in {{shutdown}}, these are passed to {{BlockPoolManager#shutDownAll}}, which
eventually stops and joins each {{BPServiceActor}} thread.  There are a few problems:

# {{BlockPoolManager#getAllNamenodeThreads}} returns an unmodifiable wrapper over its underlying
list, so callers can't mutate the list, but it's still the same shared backing list.  Later
during shutdown, the {{BPServiceActor}} is told that it can exit its main loop.  Part of that
is a call on the {{BPServiceActor}} thread to {{BlockPoolManager#remove}}.  This effectively
removes it from the backing list returned by {{BlockPoolManager#getAllNamenodeThreads}}, so
it will appear to vanish from the list before the call to {{BlockPoolManager#shutDownAll}}.
# Even if point 1 is fixed by changing {{BlockPoolManager#getAllNamenodeThreads}} to return
a deep copy, there is a similar problem in that {{BPOfferService#shutdownActor}} will remove
the actor from its internal tracking list.

Because of these 2 problems, {{DataNode#shutdown}} might no longer have a reference to the
{{BPServiceActor}} threads when it tries to stop and join on them.  Therefore, those threads
might still be alive even after completion of {{DataNode#shutdown}}.  I noticed this while
trying to write a test that asserts a particular thread has exited after DataNode shutdown.

> DataNode shutdown does not guarantee full shutdown of all threads due to race condition.
> ----------------------------------------------------------------------------------------
>                 Key: HDFS-9409
>                 URL: https://issues.apache.org/jira/browse/HDFS-9409
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: Chris Nauroth
> {{DataNode#shutdown}} is documented to return "only after shutdown is complete".  Even
after completion of this method, it's possible that threads started by the DataNode are still
running.  Race conditions in the shutdown sequence may cause it to skip stopping and joining
the {{BPServiceActor}} threads.
> This is likely not a big problem in normal operations, because these are daemon threads
that won't block overall process exit.  It is more of a problem for tests, because it makes
it impossible to write reliable assertions that these threads exited cleanly.  For large test
suites, it can also cause an accumulation of unneeded threads, which might harm test performance.

This message was sent by Atlassian JIRA

View raw message