Mailing-List: contact dev-help@giraph.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@giraph.apache.org
Date: Mon, 20 Aug 2012 10:28:38 +1100 (NCT)
From: "Avery Ching (JIRA)" <jira@apache.org>
To: giraph-dev@incubator.apache.org
Message-ID: <987336935.28593.1345418918109.JavaMail.jiratomcat@arcas>
In-Reply-To: <2018772605.26354.1345276357966.JavaMail.jiratomcat@arcas>
Subject: [jira] [Commented] (GIRAPH-306) Netty requests should be reliable
 and implement exactly once semantics
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437607#comment-13437607 ] 

Avery Ching commented on GIRAPH-306:
------------------------------------

>Yeah that was the impression I had too. Just to clarify, as of the recent Netty upgrades + this one, we are in no way >attempting to handle worker restarts with any grace right? This is all purely connection reliability for healthy worker nodes?

Yeah, this is purely for reliability of connections and requests, nothing else.

>I am having a lot more trouble scaling out to more workers than I used to. I know you guys had mentioned this, but I have >not been testing again until the last few days and its definitely gotten trickier, not the least of which because I'm >having trouble getting logs to see what happened during a fail. I don't have dumps I saved from those jobs, but if I see >more I will put them here.

Here's a trick you can try.  Add -Dmapred.map.max.attempts=1 to ensure that any failure will fail the job.  Then you can see the logs for the failed task and try to figure out what the problem is.

>Mostly the logs I get are reconnection logs after reincarnation in which they all fail (of course) and no logs for the >failed portion of the run that triggered the worker to reincarnate.

The above should help us narrow down your problem.  =)
                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira