hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HADOOP-11802) DomainSocketWatcher thread terminates sometimes after there is an I/O error during requestShortCircuitShm
Date Tue, 14 Apr 2015 23:07:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394675#comment-14394675
] 

Colin Patrick McCabe edited comment on HADOOP-11802 at 4/14/15 11:06 PM:
-------------------------------------------------------------------------

In the main finally block of the {{DomainSocketWatcher#watcherThread}}, the call to {{sendCallback}}
can encounter an {{IllegalStateException}}, and leave some cleanup tasks undone.

{code}
      } finally {
        lock.lock();
        try {
          kick(); // allow the handler for notificationSockets[0] to read a byte
          for (Entry entry : entries.values()) {
            // We do not remove from entries as we iterate, because that can
            // cause a ConcurrentModificationException.
            sendCallback("close", entries, fdSet, entry.getDomainSocket().fd);
          }
          entries.clear();
          fdSet.close();
        } finally {
          lock.unlock();
        }
      }
{code}

The exception causes {{watcherThread}} to skip the calls to {{entries.clear()}} and {{fdSet.close()}}.

{code}
2015-04-02 11:48:09,941 [DataXceiver for client unix:/home/gs/var/run/hdfs/dn_socket [Waiting
for operation #1]] INFO DataNode.clienttrace: cliID: DFSClient_NONMAPREDUCE_-807148576_1,
src: 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a, srvID: e6b6cdd7-1bf8-415f-a412-32d8493554df,
success: false
2015-04-02 11:48:09,941 [Thread-14] ERROR unix.DomainSocketWatcher: Thread[Thread-14,5,main]
terminating on unexpected exception
java.lang.IllegalStateException: failed to remove b845649551b6b1eab5c17f630e42489d
        at com.google.common.base.Preconditions.checkState(Preconditions.java:145)
        at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.removeShm(ShortCircuitRegistry.java:119)
        at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry$RegisteredShm.handle(ShortCircuitRegistry.java:102)
        at org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallback(DomainSocketWatcher.java:402)
        at org.apache.hadoop.net.unix.DomainSocketWatcher.access$1100(DomainSocketWatcher.java:52)
        at org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:522)
        at java.lang.Thread.run(Thread.java:722)
{code}

Please note that this is not a duplicate of HADOOP-11333, HADOOP-11604, or HADOOP-10404. The
cluster installation is running code with all of these fixes.

The place in {{sendCallback}} where it is encountering the exception is
{code}
    if (entry.getHandler().handle(sock)) {
{code}

Once the {{IllegalStateException}} occurs, I am seeing 4069 datanode threads getting stuck
in {{DomainSocketWatcher#add}} when {{DataXceiver}} is trying to request a new short circuit
read. This is similar to the symptoms seen in HADOOP-11333, but, as I mentioned above, the
cluster is already running with that fix.

Here is the stack trace from the stuck threads, for reference:
{noformat}
"DataXceiver for client unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operat
ion #1]" daemon prio=10 tid=0x00007fcbbcae1000 nid=0x498a waiting on condition [
0x00007fcb61132000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000d06c3a78> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
        at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:323)
        at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:322)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:403)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:214)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:95)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
        at java.lang.Thread.run(Thread.java:722)
{noformat}


was (Author: eepayne):
The place in {{sendCallback}} where it is encountering the exception is
{code}
    if (entry.getHandler().handle(sock)) {
{code}

Once the {{IllegalStateException}} occurs, I am seeing 4069 datanode threads getting stuck
in {{DomainSocketWatcher#add}} when {{DataXceiver}} is trying to request a new short circuit
read. This is similar to the symptoms seen in HADOOP-11333, but, as I mentioned above, the
cluster is already running with that fix.

Here is the stack trace from the stuck threads, for reference:
{noformat}
"DataXceiver for client unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operat
ion #1]" daemon prio=10 tid=0x00007fcbbcae1000 nid=0x498a waiting on condition [
0x00007fcb61132000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000d06c3a78> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
        at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:323)
        at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:322)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:403)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:214)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:95)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
        at java.lang.Thread.run(Thread.java:722)
{noformat}

> DomainSocketWatcher thread terminates sometimes after there is an I/O error during requestShortCircuitShm
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-11802
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11802
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Eric Payne
>            Assignee: Eric Payne
>
> In the main finally block of the {{DomainSocketWatcher#watcherThread}}, the call to {{sendCallback}}
can encounter an {{IllegalStateException}}, and leave some cleanup tasks undone.
> {code}
>       } finally {
>         lock.lock();
>         try {
>           kick(); // allow the handler for notificationSockets[0] to read a byte
>           for (Entry entry : entries.values()) {
>             // We do not remove from entries as we iterate, because that can
>             // cause a ConcurrentModificationException.
>             sendCallback("close", entries, fdSet, entry.getDomainSocket().fd);
>           }
>           entries.clear();
>           fdSet.close();
>         } finally {
>           lock.unlock();
>         }
>       }
> {code}
> The exception causes {{watcherThread}} to skip the calls to {{entries.clear()}} and {{fdSet.close()}}.
> {code}
> 2015-04-02 11:48:09,941 [DataXceiver for client unix:/home/gs/var/run/hdfs/dn_socket
[Waiting for operation #1]] INFO DataNode.clienttrace: cliID: DFSClient_NONMAPREDUCE_-807148576_1,
src: 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a, srvID: e6b6cdd7-1bf8-415f-a412-32d8493554df,
success: false
> 2015-04-02 11:48:09,941 [Thread-14] ERROR unix.DomainSocketWatcher: Thread[Thread-14,5,main]
terminating on unexpected exception
> java.lang.IllegalStateException: failed to remove b845649551b6b1eab5c17f630e42489d
>         at com.google.common.base.Preconditions.checkState(Preconditions.java:145)
>         at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.removeShm(ShortCircuitRegistry.java:119)
>         at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry$RegisteredShm.handle(ShortCircuitRegistry.java:102)
>         at org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallback(DomainSocketWatcher.java:402)
>         at org.apache.hadoop.net.unix.DomainSocketWatcher.access$1100(DomainSocketWatcher.java:52)
>         at org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:522)
>         at java.lang.Thread.run(Thread.java:722)
> {code}
> Please note that this is not a duplicate of HADOOP-11333, HADOOP-11604, or HADOOP-10404.
The cluster installation is running code with all of these fixes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message