avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gareth Davis (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-1407) NettyTransceiver can cause a infinite loop when slow to connect
Date Fri, 19 Sep 2014 15:43:36 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140754#comment-14140754

Gareth Davis commented on AVRO-1407:

10 months to respond doesn't seem too bad.... sorry.

The channel only needs to be closed only on an exception, hence the catch Throwable.  The
core problem is that the constructor is allocating resources that can't aren't reachable if
the constructor fails.

> NettyTransceiver can cause a infinite loop when slow to connect
> ---------------------------------------------------------------
>                 Key: AVRO-1407
>                 URL: https://issues.apache.org/jira/browse/AVRO-1407
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.7.5, 1.7.6
>            Reporter: Gareth Davis
>         Attachments: AVRO-1407-1.patch
> When a new {{NettyTransceiver}} is created it forces the channel to be allocated and
connected to the remote host. it waits for the connectTimeout ms on the [connect channel future|https://github.com/apache/avro/blob/1579ab1ac95731630af58fc303a07c9bf28541d6/lang/java/ipc/src/main/java/org/apache/avro/ipc/NettyTransceiver.java#L271]
this is obivously a good thing it's only that on being unsuccessful, ie {{!channelFuture.isSuccess()}}
an exception is thrown and the call to the constructor fails with an {{IOException}}, but
has the potential to leave a active channel associated with the {{ChannelFactory}}
> The problem is that a Netty {{NioClientSocketChannelFactory}} will not shutdown if there
are active channels still around and if you have supplied the {{ChannelFactory}} to the {{NettyTransceiver}}
then  you will not be able to cancel it by calling {{ChannelFactory.releaseExternalResources()}}
like the [Flume Avro RPC client does|https://github.com/apache/flume/blob/b8cf789b8509b1e5be05dd0b0b16c5d9af9698ae/flume-ng-sdk/src/main/java/org/apache/flume/api/NettyAvroRpcClient.java#L158].
In order to recreate this you need a very laggy network, where the connect attempt takes longer
than the connect timeout but does actually work, this very hard to organise in a test case,
although I do have a test setup using vagrant VM's that recreates this everytime, using the
Flume RPC client and server.
> The following stack is from a production system, it won't ever leave recover until the
channel is disconnected (by forcing a disconnect at the remote host) or restarting the JVM.
> {noformat:title=Production stack trace}
> "TLOG-0" daemon prio=10 tid=0x00007f581c7be800 nid=0x39a1 waiting on condition [0x00007f57ef9f2000]
>   java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   parking to wait for <0x00000007218b16e0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
>   at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
>   at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1253)
>   at org.jboss.netty.util.internal.ExecutorUtil.terminate(ExecutorUtil.java:103)
>   at org.jboss.netty.channel.socket.nio.AbstractNioWorkerPool.releaseExternalResources(AbstractNioWorkerPool.java:80)
>   at org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.releaseExternalResources(NioClientSocketChannelFactory.java:181)
>   at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:142)
>   at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:101)
>   at org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:564)
>   locked <0x00000006c30ae7b0> (a org.apache.flume.api.NettyAvroRpcClient)
>   at org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:88)
>   at org.apache.flume.api.LoadBalancingRpcClient.createClient(LoadBalancingRpcClient.java:214)
>   at org.apache.flume.api.LoadBalancingRpcClient.getClient(LoadBalancingRpcClient.java:205)
>   locked <0x00000006a97b18e8> (a org.apache.flume.api.LoadBalancingRpcClient)
>   at org.apache.flume.api.LoadBalancingRpcClient.appendBatch(LoadBalancingRpcClient.java:95)
>   at com.ean.platform.components.tlog.client.service.AvroRpcEventRouter$1.call(AvroRpcEventRouter.java:45)
>   at com.ean.platform.components.tlog.client.service.AvroRpcEventRouter$1.call(AvroRpcEventRouter.java:43)
> {noformat}
> The solution is very simple, and a patch should be along in a moment.

This message was sent by Atlassian JIRA

View raw message