giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <ach...@apache.org>
Subject Re: [jira] [Commented] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
Date Wed, 15 Aug 2012 22:54:41 GMT
Yes, this will happen, but should be okay, since the connect retries 
will take care of it (hopefully). This already happened with the old 
code (as you mentioned).

I'm also working on a more robust implementation that will retry failed 
requests going forward (and establish broken connections).

Avery

On 8/15/12 3:04 PM, Eli Reisman (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435564#comment-13435564
]
>
> Eli Reisman commented on GIRAPH-300:
> ------------------------------------
>
> Getting errors like this during input superstep on about 20% of my workers, happens on
small and large jobs. This happened before this patch got committed, but seems to be happening
now too. Anyone seeing this on your runs?
>
>
> Aug 15, 2012 9:55:25 PM org.jboss.netty.channel.DefaultChannelPipeline
> WARNING: An exception was thrown by a user handler while handling an exception event
([id: 0x48433545] EXCEPTION: java.net.ConnectException: Connection timed out)
> java.lang.IllegalStateException: exceptionCaught: Channel failed with remote address
null
> 	at org.apache.giraph.comm.ResponseClientHandler.exceptionCaught(ResponseClientHandler.java:107)
> 	at org.jboss.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:244)
> 	at org.apache.giraph.comm.ByteCounter.handleUpstream(ByteCounter.java:61)
> 	at org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:426)
> 	at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:406)
> 	at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:362)
> 	at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:284)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> Caused by: java.net.ConnectException: Connection timed out
> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> 	at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:400)
> 	... 5 more
>
>                  
>> Improve netty reliability with retrying failed connections, tracking requests, thread-safe
hash partitioning
>> ------------------------------------------------------------------------------------------------------------
>>
>>                  Key: GIRAPH-300
>>                  URL: https://issues.apache.org/jira/browse/GIRAPH-300
>>              Project: Giraph
>>           Issue Type: Improvement
>>             Reporter: Avery Ching
>>             Assignee: Avery Ching
>>          Attachments: GIRAPH-300.2.patch, GIRAPH-300.patch
>>
>>
>> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
>> * Try multiple connection attempts up to n failures
>> * Track requests throughout the system by keeping track of the request id and then
matching the request id to the response (minor refactoring of WritableRequest to make requests
simpler and support the request id)
>> * Improved handling of netty exceptions by dumping the exception stack to help debug
failures
>> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes
divide by zero exceptions in real life)
>> Currently, netty connection failures causes issues with more than 75 workers in my
setup.  This allows us to reach over 200+ in a reasonably reliable network that doesn't kill
connections.
>> This code passes the local Hadoop regressions and the single node Hadoop instance
regressions.  It also succeeded on large runs (200+ workers) on a real Hadoop cluster.
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>          


Mime
View raw message