avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Baldassari (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-1013) NettyTransceiver can hang after server restart
Date Mon, 30 Jan 2012 00:33:10 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195894#comment-13195894
] 

James Baldassari commented on AVRO-1013:
----------------------------------------

The second change I described to NettyServer#isConnected() actually broke a bunch of unit
tests, so I'm just going to leave that method unchanged.
                
> NettyTransceiver can hang after server restart
> ----------------------------------------------
>
>                 Key: AVRO-1013
>                 URL: https://issues.apache.org/jira/browse/AVRO-1013
>             Project: Avro
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>            Reporter: James Baldassari
>            Priority: Blocker
>
> I ran into a very specific scenario today which can lead to NettyTransceiver hanging
indefinitely:
> # Start up a NettyServer
> # Initialize a NettyTransceiver and SpecificRequestor
> # Execute an RPC to establish the connection/handshake with the server
> # Shut down the server
> # Immediately execute another RPC
> After Step 4, NettyTransceiver will detect that the connection has been closed and call
NettyTransceiver#disconnect(boolean, boolean, Throwable), which sets 'remote' to null, indicating
to Requestor that the NettyTransceiver is now disconnected.  However, if an RPC is executed
just after the server has closed its socket (Step 5) and before disconnect() has been called,
NettyTransceiver may still try to send this RPC because 'remote' has not yet been set to null.
 This race condition is normally ok because NettyTransceiver#getChannel() will detect that
the socket has been closed and then try to reestablish the connection.  Unfortunately, in
this scenario getChannel() blocks forever when it attempts to acquire the write lock because
the read lock has been acquired twice rather than once as getChannel() expects.  The read
lock is acquired once by transceive(List<ByteBuffer>, Callback<List<ByteBuffer>>)
and again by writeDataPack(NettyDataPack).
> The fix is fairly simple.  The writeDataPack(NettyDataPack) method (which is private)
does not acquire the read lock but specifies in its contract that the read lock must acquired
before calling this method.  This change prevents the read lock from being acquired more than
once by any single thread.  Another change is to have NettyTransceiver#isConnected() perform
two checks instead of one: remote != null && isChannelReady(channel).  This second
change should allow NettyTransceiver to detect disconnect events more quickly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message