hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anu Engineer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11957) if an IOException error is thrown in DomainSocket.close we go into infinite loop.
Date Tue, 19 May 2015 19:24:01 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551062#comment-14551062

Anu Engineer commented on HADOOP-11957:

bq. Thanks for looking at this, Anu Engineer. The patch is not correct because right now it
unconditionally breaks out of the loop without waiting for the socket reference count to drop
to 0. This is very dangerous because it means that we may call close on a file descriptor
while another thread is still using it. If this happens, a third thread may open another file,
which will get the same file descriptor number as the one which was just closed. And then
the second thread is doing some arbitrary operation on the wrong file descriptor.

if we got a socket error, then waiting for the count to go to zero will put you on an infinite
loop. What would you like to do ? Fail fast or wait forever ? 

I was seeing calls to shutdown0(native) would fail, in this specific case shutdown0 failed
with socket not connected error.  if you look at the change that I proposed(the test failed)
-- was to try to call shutdown with the counter getting decremented, instead of the current
call where we call into shutdown0. I have not investigated why the specific test case failed.

bq. To be honest, I do not see any way of handling this better than we currently do. If shutdown
fails, we simply have to block until the socket is no longer in use. 

if the shutdown fails, we should bail since current call is to the OS native call layer, then
the assumption that failure happened due to someone else is using socket - AFAIK is not correct.
We should be able to shutdown a socket even if someone is using it , that user will get an
error is a different problem. 

bq. There is no possible shortcut that I can see.

we are in a bad situation when this happens, in the specific case I was seeing  ENOTCONN.
You would be absolutely right, if the error was due to calling socket-Shutdown in java code
in DomainSocket.java, but this call it is to  OS layer and if the OS says it is an invalid
socket, no one else can use it either. That is having a reference count is not going to help

bq. Did you encounter this problem in production? If so, what was the shutdown error?

Production - No, I don't own any clusters unfortunately. During bug bash I was trying to shepherd
one of the bugs from [~cnauroth] , where he had fixed an issue to get the native code compiling
on mac HDFS-3296. But did not get complete it because socket calls were failing. We found
3 separate bugs in that path. I will file those JIRAs ASAP. 

* On Mac - Domain Sockets Path cannot be more then 103 bytes and for some reason getting /tmp
+ date time stamp will make the path longer , because /tmp is a symbolic link. - I have a
simple fix for this.

* On Mac - This failure happens in a specific test , where OS returns socket is not connected
error and we get stuck, we will eventually come out since the test will time out. - This is
the one that I was fixing, In fact I will create a simple wrapper around shutdown0(so I can
use mockito to inject errors) and then call into the same call as you have proposed. But would
like to break out if we get an error like ENOTCONN - Socket not connected. At that point no
one can be using that socket anyway. From looking at the man pages - http://man7.org/linux/man-pages/man2/shutdown.2.html
- I get a feeling that if a socket shutdown has failed, that socket will not be in use by
anyone. They would be getting errors while communicating is my feeling. Do you agree ? if
so, then breaking out seems to be the right thing to do.

* On Mac - There is another bug, where we ask the native layer to accept a connection with
timeout and that call also fails, I looked at the code and did not see anything really wrong.
I need to put it under a debugger. Hopefully once we fix these three bugs, we can run native
library on Mac

> if an IOException error is thrown in DomainSocket.close we go into infinite loop.
> ---------------------------------------------------------------------------------
>                 Key: HADOOP-11957
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11957
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: net
>    Affects Versions: 2.7.1
>            Reporter: Anu Engineer
>            Assignee: Anu Engineer
>         Attachments: HADOOP-11957.001.patch
> if an IOException error is thrown in DomainSocket.close we go into infinite loop.
> Issue : If the shutdown0(fd) call throws an IOException we break out of the if shutdown
call but will continue to loop in the while loop infinitely since we have no way of decrementing
the counter. Please scroll down and see the comment marked with Bug Bug to see where the issue
> {code:title=DomainSocket.java}
>   @Override
>   public void close() throws IOException {
>     // Set the closed bit on this DomainSocket
>     int count = 0;
>     try {
>       count = refCount.setClosed();
>     } catch (ClosedChannelException e) {
>       // Someone else already closed the DomainSocket.
>       return;
>     }
>     // Wait for all references to go away
>     boolean didShutdown = false;
>     boolean interrupted = false;
>     while (count > 0) {
>       if (!didShutdown) {
>         try {
>           // Calling shutdown on the socket will interrupt blocking system
>           // calls like accept, write, and read that are going on in a
>           // different thread.
>           shutdown0(fd);
>         } catch (IOException e) {
>           LOG.error("shutdown error: ", e);
>         }
>         didShutdown = true; 
>         // *BUG BUG* <-- Here the code will never exit the loop
>         // if the count is greater then 0. we need to break out
>         // of the while loop in case of IOException Error
>       }
>       try {
>         Thread.sleep(10);
>       } catch (InterruptedException e) {
>         interrupted = true;
>       }
>       count = refCount.getReferenceCount();
>     }
>     // At this point, nobody has a reference to the file descriptor, 
>     // and nobody will be able to get one in the future either.
>     // We now call close(2) on the file descriptor.
>     // After this point, the file descriptor number will be reused by 
>     // something else.  Although this DomainSocket object continues to hold 
>     // the old file descriptor number (it's a final field), we never use it 
>     // again because this DomainSocket is closed.
>     close0(fd);
>     if (interrupted) {
>       Thread.currentThread().interrupt();
>     }
>   }
> {code}

This message was sent by Atlassian JIRA

View raw message