hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (Jira)" <j...@apache.org>
Subject [jira] [Work logged] (HADOOP-17552) The read method of hadoop.ipc.Client$Connection$PingInputStream may swallow java.net.SocketTimeoutException due to the mistaken usage of the rpcTimeout configuration
Date Mon, 01 Mar 2021 06:30:00 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=559217&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-559217
]

ASF GitHub Bot logged work on HADOOP-17552:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 01/Mar/21 06:29
            Start Date: 01/Mar/21 06:29
    Worklog Time Spent: 10m 
      Work Description: functioner commented on pull request #2727:
URL: https://github.com/apache/hadoop/pull/2727#issuecomment-787684867


   > > > @functioner see TestRPC#testClientRpcTimeout
   > > > Maybe It is expected that Send ping all the time when _ipc.client.rpc-timeout.ms_
less than 0.
   > > 
   > > 
   > > @ferhui If this is expected, then the client should be able to automatically
request a new connection rather than blindly waiting for the problematic connection it currently
uses. Otherwise this endless waiting is meaningless, because now it's impossible for the server
to add this connection to its Reader. In this case, timeout is better, because the user (client)
at least can be aware of this issue.
   > 
   > In this case, we can set ipc.client.rpc-timeout.ms suitable value and resolve it.
Let us see author(@iwasakims )'s thoughts first and then resolve it carefully.
   
   Agree.
   And just for reference, the explanation of `ipc.client.rpc-timeout.ms` configuration in
[core-default.xml](https://hadoop.apache.org/docs/r3.2.2/hadoop-project-dist/hadoop-common/core-default.xml)
also emphasizes "the effective value of the rpc-timeout is rounded up to multiple of ipc.ping.interval".


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 559217)
    Time Spent: 2h 20m  (was: 2h 10m)

> The read method of hadoop.ipc.Client$Connection$PingInputStream may swallow java.net.SocketTimeoutException
due to the mistaken usage of the rpcTimeout configuration
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-17552
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17552
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 3.2.2
>            Reporter: Haoze Wu
>            Priority: Critical
>              Labels: pull-request-available
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
>     We are doing some systematic fault injection testing in Hadoop-3.2.2 and when
we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster (1 NameNode, 2 DataNodes),
the client gets stuck forever. After some investigation, we believe that it’s a bug in `hadoop.ipc.Client`
because the read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps swallowing
`java.net.SocketTimeoutException` due to the mistaken usage of the `rpcTimeout` configuration
in the `handleTimeout` method.
> *Reproduction*
>     Start HDFS with the default configuration. Then execute a client (we used the
command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to accept the client’s
socket, inject a socket error (java.net.SocketException or java.io.IOException), specifically
at line 1402 (line 1403 or 1404 will also work).
>     We prepare the scripts for reproduction in a gist ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]).
> *Diagnosis*
>     When the NameNode tries to accept a client’s socket, basically there are 4 steps:
>  # accept the socket (line 1400)
>  # configure the socket (line 1402-1404)
>  # make the socket a Reader (after line 1404)
>  # swallow the possible IOException in line 1350
> {code:java}
> //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
>     public void run() {
>       while (running) {
>         SelectionKey key = null;
>         try {
>           getSelector().select();
>           Iterator<SelectionKey> iter = getSelector().selectedKeys().iterator();
>           while (iter.hasNext()) {
>             key = iter.next();
>             iter.remove();
>             try {
>               if (key.isValid()) {
>                 if (key.isAcceptable())
>                   doAccept(key);
>               }
>             } catch (IOException e) {                         // line 1350
>             }
>             key = null;
>           }
>         } catch (OutOfMemoryError e) {
>           // ...
>         } catch (Exception e) {
>           // ...
>         }
>       }
>     }
>     void doAccept(SelectionKey key) throws InterruptedException, IOException, 
>         OutOfMemoryError {
>       ServerSocketChannel server = (ServerSocketChannel) key.channel();
>       SocketChannel channel;
>       while ((channel = server.accept()) != null) {           // line 1400
>         channel.configureBlocking(false);                     // line 1402
>         channel.socket().setTcpNoDelay(tcpNoDelay);           // line 1403
>         channel.socket().setKeepAlive(true);                  // line 1404
>         
>         Reader reader = getReader();
>         Connection c = connectionManager.register(channel,
>             this.listenPort, this.isOnAuxiliaryPort);
>         // If the connectionManager can't take it, close the connection.
>         if (c == null) {
>           if (channel.isOpen()) {
>             IOUtils.cleanup(null, channel);
>           }
>           connectionManager.droppedConnections.getAndIncrement();
>           continue;
>         }
>         key.attach(c);  // so closeCurrentConnection can get the object
>         reader.addConnection(c);
>       }
>     }
> {code}
>     When a SocketException occurs in line 1402 (or 1403 or 1404), the server.accept()
in line 1400 has finished, so we expect the following behavior:
>  # The server (NameNode) accepts this connection but it will basically write nothing
to this connection because it’s not added as a Reader data structure.
>  # The client is aware that the connection has been established, and tries to read and
write in this connection. After some time threshold, the client finds that it can’t read
anything from this connection and exits with some exception or error.
>     However, we do not observe behavior 2. The client just gets stuck forever (>10min).
We re-examine the default configuration in [https://hadoop.apache.org/docs/r3.2.2/hadoop-project-dist/hadoop-common/core-default.xml]
and we believe that the client should be able to time out in 60 seconds, according to the
parameter `ipc.client.rpc-timeout.ms`, `ipc.client.ping` and `ipc.ping.interval`.
>     We find where the client gets stuck from the jstack dump:
> {code:java}
> "main" #1 prio=5 os_prio=0 tid=0x00007f5554019800 nid=0x4312 in Object.wait() [0x00007f555d62e000]
>    java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x000000071cc19ff0> (a org.apache.hadoop.ipc.Client$Call)
>         at java.lang.Object.wait(Object.java:502)
>         at org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:59)
>         at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1551)
>         - locked <0x000000071cc19ff0> (a org.apache.hadoop.ipc.Client$Call)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1508)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:910)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         - locked <0x000000071cb435b8> (a org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1671)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1602)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1599)
>         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1614)
>         at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:65)
>         at org.apache.hadoop.fs.Globber.doGlob(Globber.java:294)
>         at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
>         at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2050)
>         at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:353)
>         at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:250)
>         at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:233)
>         at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:104)
>         at org.apache.hadoop.fs.shell.Command.run(Command.java:177)
>         at org.apache.hadoop.fs.FsShell.run(FsShell.java:327)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at org.apache.hadoop.fs.FsShell.main(FsShell.java:390)
> {code}
>     According to org.apache.hadoop.ipc.Client.call(Client.java:1508), the runtime
value of timeout in org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1551) is -1, meaning
that it waits forever. The only way of notifying it is within the callComplete method of Client$Call
(Client.java:367). To invoke callComplete, there are only 2 methods in Client$Call: setException
and setRpcResponse.
>     Our expectation is that after the timeout in the client happens, the setException
will be finally invoked. By some analysis, we will explain the workflow of dealing with this
timeout issue and point out where the bug is.
>     The setException method should be invoked by the Client$Connection thread:
> {code:java}
>     @Override
>     public void run() {
>       // ...
>       try {
>         while (waitForWork()) {                       // line 1086
>           receiveRpcResponse();                       // line 1087
>         }
>       } catch (Throwable t) {
>         // ...
>         markClosed(new IOException("Error reading responses", t));
>       }
>       close();                                        // line 1097
>       // ...
>     }    private void receiveRpcResponse() {
>       // ...
>       try {
>         ByteBuffer bb = ipcStreams.readResponse();    // line 1191
>         // ...
>       } catch (IOException e) {
>         markClosed(e);                                // line 1235
>       }
>     }
> {code}
>     When the timeout happens, the correct workflow is:
>  # Before the I/O, the run method invokes receiveRpcResponse (line 1087) and then invokes
readResponse (line 1191)
>  # After timeout, the readResponse invocation (line 1191) throws java.net.SocketTimeoutException
>  # This exception is caught by markClosed (line 1235) and handled
>  # The waitForWork (line 1086) returns false due to markClosed’s handling
>  # The close method (line 1097) gets run, and finally invokes the setException method,
which will unlock the org.apache.hadoop.fs.FsShell.main thread that we currently get stuck
in.
>     The bug is within step 2. We confirmed that this Client$Connection thread is running
the readResponse invocation (line 1191) forever, without throwing any exception. We have the
jstack dump of this Client$Connection thread:
> {code:java}
> "IPC Client (1390869998) connection to /127.0.0.1:9000 from whz" #16 daemon prio=5 os_prio=0
tid=0x00007f555551e000 nid=0x432b run
> nable [0x00007f5524126000]
>    java.lang.Thread.State: RUNNABLE
>         at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>         at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>         at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>         at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>         - locked <0x000000071ce57d28> (a sun.nio.ch.Util$3)
>         - locked <0x000000071ce57ca0> (a java.util.Collections$UnmodifiableSet)
>         - locked <0x000000071ce578d8> (a sun.nio.ch.EPollSelectorImpl)
>         at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>         at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:336)
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>         - locked <0x000000071ce85708> (a java.io.BufferedInputStream)
>         at java.io.FilterInputStream.read(FilterInputStream.java:83)
>         at java.io.FilterInputStream.read(FilterInputStream.java:83)
>         at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:562)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1881)
>         at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1191)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1087)
> {code}
>     By investigating each level of this stack trace, we confirmed that there should
be a java.net.SocketTimeoutException thrown by org.apache.hadoop.net.SocketIOWithTimeout.doIO
([https://github.com/apache/hadoop/blob/rel/release-3.2.2/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/SocketIOWithTimeout.java#L164]),
but this exception is swallowed by org.apache.hadoop.ipc.Client$Connection$PingInputStream.read:
> {code:java}
>       public int read() throws IOException {
>         int waiting = 0;
>         do {
>           try {
>             return super.read();                // appear in stack trace
>           } catch (SocketTimeoutException e) {
>             waiting += soTimeout;
>             handleTimeout(e, waiting);          // line 565
>           }
>         } while (true);                         // line 567
>       }
> {code}
>     The handleTimeout invocation (line 565) should throw this SocketTimeoutException
again. Otherwise the infinite loop indicated by line 567 will run forever.
>     However, the handleTimeout method fails to enter the branch of throwing exception:
> {code:java}
>      /* Process timeout exception
>        * if the connection is not going to be closed or 
>        * the RPC is not timed out yet, send a ping.
>        */
>       private void handleTimeout(SocketTimeoutException e, int waiting)
>           throws IOException {
>         if (shouldCloseConnection.get() || !running.get() ||
>             (0 < rpcTimeout && rpcTimeout <= waiting)) {            //
line 545
>           throw e;
>         } else {
>           sendPing();                                               // line 548
>         }
>       }
> {code}
>     It goes to line 548 and sends PING. PING can be sent successfully because the
connection is currently not broken. However, in the correct behavior, line 545 is evaluated
to be true and the exception can be thrown again.
>     Line 545 is evaluated to be false, because the rpcTimeout variable (default value:
0) (`ipc.client.rpc-timeout.ms` in the default configuration [https://hadoop.apache.org/docs/r3.2.2/hadoop-project-dist/hadoop-common/core-default.xml]
) seems to be used in an incorrect way in this function (handleTimeout). According to the
explanation of `ipc.client.rpc-timeout.ms` in the default configuration [https://hadoop.apache.org/docs/r3.2.2/hadoop-project-dist/hadoop-common/core-default.xml],
“If ipc.client.ping is set to true and this rpc-timeout is greater than the value of ipc.ping.interval,
the effective value of the rpc-timeout is rounded up to multiple of ipc.ping.interval.”
>     The rpcTimeout is used correctly in the constructor of Client$Connection class:
> {code:java}
>     Connection(ConnectionId remoteId, int serviceClass,
>         Consumer<Connection> removeMethod) {
>       // ...
>       if (rpcTimeout > 0) {
>         // effective rpc timeout is rounded up to multiple of pingInterval
>         // if pingInterval < rpcTimeout.
>         this.soTimeout = (doPing && pingInterval < rpcTimeout) ?
>             pingInterval : rpcTimeout;
>       } else {
>         this.soTimeout = pingInterval;
>       }
>       // ...
>     }
> {code}
>     We have confirmed that, in the handleException method, if we use this.soTimeout
variable prepared by this constructor, then this bug is fixed.
> *Fix*
>     We propose that we should modify the line 545 of the handleException method of
the Client$Connection class. The value of rpcTimeout should be used in the way that the constructor
of Client$Connection deals with it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message