hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Payne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11802) DomainSocketWatcher#watcherThread can encounter IllegalStateException in finally block when calling sendCallback
Date Mon, 13 Apr 2015 22:11:13 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14493182#comment-14493182

Eric Payne commented on HADOOP-11802:

Thanks again, [~cmccabe], for your comments and taking time on this issue.

One thing to note is that just prior to these problems, a 195-second GC was taking place on
the DN.

I added a catch of {{Throwable}} in the main thread of the {{DomainSocketWatcher}} and reproduced
the problem. AFAICT, the following represents what is happening:

- Request for short circuit read is received
- {{DataXceiver#requestShortCircuitShm}} calls {{ShortCircuitRegistry#createNewMemorySegment}},
which creates a shared memory segment and associates it with the passed domain socket in the
{{DomainSocketWatcher}}. Then, in that thread, {{createNewMemorySegment}} waits on that socket/shm
entry in {{DomainSocketWatcher#add}}.
  public NewShmInfo createNewMemorySegment(String clientName,
    watcher.add(sock, shm);
- It's at this point that things get confusing, and I'm still working on why this happens.
The wait wakes up, but things are not normal, but it wasn't woken up because of an exception,
either. You can tell that no exception was thrown inside {{createNewMemorySegment}} to wake
it up because the following code goes on to call {{sendShmSuccessRespons}}, which is where
the next bad thing happens:
    public void requestShortCircuitShm(String clientName) throws IOException {
      try {
        shmInfo = datanode.shortCircuitRegistry.
            createNewMemorySegment(clientName, sock);
        // After calling #{ShortCircuitRegistry#createNewMemorySegment}, the
        // socket is managed by the DomainSocketWatcher, not the DataXceiver.
      } catch (UnsupportedOperationException e) {
            "This datanode has not been configured to support " +
            "short-circuit shared memory segments.");
      } catch (IOException e) {
            "Failed to create shared file descriptor: " + e.getMessage());
      sendShmSuccessResponse(sock, shmInfo);
- At this point, the call to {{sendShmSuccessResponse}} gets an exception:
2015-04-04 13:12:30,973 [DataXceiver for client unix:/home/gs/var/run/hdfs/dn_socket [Waiting
for operation #1]]
      INFO DataNode.clienttrace: cliID: DFSClient_attempt_1427231924849_569269_m_002116_0_-161414780_1,
      src:, dest:, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a,
      srvID: a2d3bac0-e98b-4b73-a5a1-82c7eb557a7a, success: false
2015-04-04 13:12:30,984 [DataXceiver for client unix:/home/gs/var/run/hdfs/dn_socket [Waiting
for operation #1]]
      ERROR datanode.DataNode: host.domain.com:1004:DataXceiver error processing
      REQUEST_SHORT_CIRCUIT_SHM operation  src: unix:/home/gs/var/run/hdfs/dn_socket dst:
java.net.SocketException: write(2) error: Broken pipe
        at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
        at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
        at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:601)
        at com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833)
        at com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843)
        at com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:91)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.sendShmSuccessResponse(DataXceiver.java:380)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:418)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:214)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:95)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
        at java.lang.Thread.run(Thread.java:722)
- At this point, it bubbles back up to {{DataXceiver#requestShortCircuitShm}}, which cleans
up, closing the socket:
      if ((!success) && (peer == null)) {
        // If we failed to pass the shared memory segment to the client,
        // close the UNIX domain socket now.  This will trigger the 
        // DomainSocketWatcher callback, cleaning up the segment.
        IOUtils.cleanup(null, sock);
- Then, the main {{DomainSocketWatcher}} thread wakes up (after regular timeout interval has
expired), and tries to call {{sendCallbackAndRemove}}, which encounters the following {{IllegalArgumentException}}:
  final Thread watcherThread = new Thread(new Runnable() {
        while (true) {
          try {
            for (int fd : fdSet.getAndClearReadableFds()) {
              sendCallbackAndRemove("getAndClearReadableFds", entries, fdSet,
ERROR unix.DomainSocketWatcher: org.apache.hadoop.net.unix.DomainSocketWatcher$2@76845081
      terminating on Throwable
java.lang.IllegalArgumentException: DomainSocketWatcher(103231254): file descriptor 249 was
      while still in the poll(2) loop.
        at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
        at org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallback(DomainSocketWatcher.java:421)
        at org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallbackAndRemove(DomainSocketWatcher.java:448)
        at org.apache.hadoop.net.unix.DomainSocketWatcher.access$500(DomainSocketWatcher.java:52)
        at org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:470)
        at java.lang.Thread.run(Thread.java:745)

I welcome your suggestions and look forward to your feedback.

> DomainSocketWatcher#watcherThread can encounter IllegalStateException in finally block
when calling sendCallback
> ----------------------------------------------------------------------------------------------------------------
>                 Key: HADOOP-11802
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11802
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Eric Payne
>            Assignee: Eric Payne
> In the main finally block of the {{DomainSocketWatcher#watcherThread}}, the call to {{sendCallback}}
can encounter an {{IllegalStateException}}, and leave some cleanup tasks undone.
> {code}
>       } finally {
>         lock.lock();
>         try {
>           kick(); // allow the handler for notificationSockets[0] to read a byte
>           for (Entry entry : entries.values()) {
>             // We do not remove from entries as we iterate, because that can
>             // cause a ConcurrentModificationException.
>             sendCallback("close", entries, fdSet, entry.getDomainSocket().fd);
>           }
>           entries.clear();
>           fdSet.close();
>         } finally {
>           lock.unlock();
>         }
>       }
> {code}
> The exception causes {{watcherThread}} to skip the calls to {{entries.clear()}} and {{fdSet.close()}}.
> {code}
> 2015-04-02 11:48:09,941 [DataXceiver for client unix:/home/gs/var/run/hdfs/dn_socket
[Waiting for operation #1]] INFO DataNode.clienttrace: cliID: DFSClient_NONMAPREDUCE_-807148576_1,
src:, dest:, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a, srvID: e6b6cdd7-1bf8-415f-a412-32d8493554df,
success: false
> 2015-04-02 11:48:09,941 [Thread-14] ERROR unix.DomainSocketWatcher: Thread[Thread-14,5,main]
terminating on unexpected exception
> java.lang.IllegalStateException: failed to remove b845649551b6b1eab5c17f630e42489d
>         at com.google.common.base.Preconditions.checkState(Preconditions.java:145)
>         at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.removeShm(ShortCircuitRegistry.java:119)
>         at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry$RegisteredShm.handle(ShortCircuitRegistry.java:102)
>         at org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallback(DomainSocketWatcher.java:402)
>         at org.apache.hadoop.net.unix.DomainSocketWatcher.access$1100(DomainSocketWatcher.java:52)
>         at org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:522)
>         at java.lang.Thread.run(Thread.java:722)
> {code}
> Please note that this is not a duplicate of HADOOP-11333, HADOOP-11604, or HADOOP-10404.
The cluster installation is running code with all of these fixes.

This message was sent by Atlassian JIRA

View raw message