avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Baldassari (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AVRO-1292) NettyTransceiver: Client threads can block under certain connection failure scenarios
Date Tue, 09 Apr 2013 22:00:16 GMT

     [ https://issues.apache.org/jira/browse/AVRO-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

James Baldassari updated AVRO-1292:
-----------------------------------

    Attachment: AVRO-1292-Part2-v2.patch

Noticed one minor issue with Part 2 of the patch.  The correct keep-alive parameter for the
client bootstrap is keepAlive rather than child.keepAlive.  The latter is the option for the
server bootstrap.
                
> NettyTransceiver: Client threads can block under certain connection failure scenarios
> -------------------------------------------------------------------------------------
>
>                 Key: AVRO-1292
>                 URL: https://issues.apache.org/jira/browse/AVRO-1292
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.7.4
>            Reporter: James Baldassari
>            Assignee: James Baldassari
>              Labels: avro, ipc, netty
>         Attachments: AVRO-1292-Part1.patch, AVRO-1292-Part2.patch, AVRO-1292-Part2-v2.patch
>
>
> I've recently found a couple of different failure scenarios with NettyTransceiver that
result in:
> * Client threads blocking for long periods of time (uninterruptibly at that) while holding
the {{stateLock}} write lock
> * RPCs (either sync or async) never returning because a failure in sending the RPC was
not propagated back up to the caller
> The patch I'm going to submit will probably be a lot easier to understand, but I'll try
to explain the main problems I found.  There is a single type of underlying connectivity issue
that seems to trigger both of these problems in NettyTransceiver: a failure at the network
layer causes all packets to be dropped somewhere between the RPC client and server.  You might
think this would be a rare scenario, but it has happened several times in our production environment
and usually occurs after the RPC server machine becomes unresponsive and needs to be physically
rebooted.  The only way I've been able to reproduce this scenario for testing purposes has
been to set up an iptables rule on the RPC server that simply drops all incoming packets from
the client.  For example, if the client's IP is 10.0.0.1 I would use the following iptables
rule on the server to reproduce the failure:
> {code}
> iptables -t mangle -A INPUT --source 10.0.0.1 -j DROP
> {code}
> After looking through a lot of stack traces I think I've identified 2 main problems:
> *Problem 1:* NettyTransceiver calls {{ChannelFuture#awaitUninterruptibly(long)}} in a
couple places, {{getChannel()}} and {{disconnect(boolean,boolean,Throwable)}}.  Under the
dropped packet scenario I outlined above, the client thread ends up blocking uninterruptibly
for the entire connection timeout duration while holding the {{stateLock}} write lock.  The
stack trace for this situation looks like this:
> {code}
> "RPC Executor - 11 - 1363627762930" daemon prio=10 tid=0x00002aaad005f000 nid=0x56cf
in Object.wait() [0x0000000049344000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:443)
>         at org.jboss.netty.channel.DefaultChannelFuture.await0(DefaultChannelFuture.java:265)
>         - locked <0x0000000703acfa00> (a org.jboss.netty.channel.DefaultChannelFuture)
>         at org.jboss.netty.channel.DefaultChannelFuture.awaitUninterruptibly(DefaultChannelFuture.java:237)
>         at org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:248)
>         at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:199)
>         at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:148)
> {code}
> At a minimum it should be possible to interrupt these connection attempts.
> *Problem 2:* When an error occurs writing to the Netty channel the error is not passed
back up the stack or callback chain (whether it's a sync or async RPC), so the client can
end up waiting indefinitely for an RPC that will never return because an error occurred sending
the Netty packet (i.e. it was never sent to the server in the first place).  This scenario
might yield a stack trace like the following:
> {code}
> "main" prio=10 tid=0x00007f9400008800 nid=0x379b waiting on condition [0x00007f9406bc6000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000007af677960> (a java.util.concurrent.CountDownLatch$Sync)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
>         at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207)
>         at org.apache.avro.ipc.CallFuture.await(CallFuture.java:141)
>         at org.apache.avro.ipc.Requestor.request(Requestor.java:150)
>         at org.apache.avro.ipc.Requestor.request(Requestor.java:101)
>         at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88)
>         at $Proxy9.send(Unknown Source)
> {code}
> It's difficult to provide a unit test for these issues because a connection refused error
alone will not trigger it.  The only way I've been able to reliably reproduce it is by setting
the iptables rule I mentioned above.  Hopefully a code review will be sufficient, but if necessary
I can try to find a way to create a unit test.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message