giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eli Reisman <initialcont...@gmail.com>
Subject Re: [jira] [Commented] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics
Date Mon, 20 Aug 2012 18:06:28 GMT
Damn I wish I'd have read this last night! Thanks for the tip. I will try
that, I am finding as I (optimistically!) tested at the low memory levels
the logs just don't make it to me at all. As I ramp it up a bit, I finally
start to get them again. This is why I didn't know if the 246-NEW-FIX-2
patch was working or not on Friday. I see now its netty connection errors
(timed out, host to connect to == null, etc.) and simple GC OutOFMemory
exceptions from netty pipeline handlers most of the time.

Still, Giraph is more resilient to working with existing MR jobs on the
cluster coming and going without causing us to fail, etc. this is real
progress, Netty problems will settle out as we find ways to configure the
new improvements to work for us here I'm sure. In general Giraph is running
great now. Keep it up!

I love the bit set idea too. I have heard the standard java implementation
is not so hot, is there an alternate library (or maybe we can build one
directly into the class) that would be lower profile? Anyway all of this
stuff seems like required pieces for Netty to be reliable, great work.


On Sun, Aug 19, 2012 at 2:48 PM, Eli Reisman (JIRA) <jira@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437585#comment-13437585]
>
> Eli Reisman commented on GIRAPH-306:
> ------------------------------------
>
> Yeah that was the impression I had too. Just to clarify, as of the recent
> Netty upgrades + this one, we are in no way attempting to handle worker
> restarts with any grace right? This is all purely connection reliability
> for healthy worker nodes?
>
> I am having a lot more trouble scaling out to more workers than I used to.
> I know you guys had mentioned this, but I have not been testing again until
> the last few days and its definitely gotten trickier, not the least of
> which because I'm having trouble getting logs to see what happened during a
> fail. I don't have dumps I saved from those jobs, but if I see more I will
> put them here.
>
> Mostly the logs I get are reconnection logs after reincarnation in which
> they all fail (of course) and no logs for the failed portion of the run
> that triggered the worker to reincarnate.
>
>
> > Netty requests should be reliable and implement exactly once semantics
> > ----------------------------------------------------------------------
> >
> >                 Key: GIRAPH-306
> >                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
> >             Project: Giraph
> >          Issue Type: Improvement
> >            Reporter: Avery Ching
> >            Priority: Critical
> >         Attachments: GIRAPH-306.patch
> >
> >
> > One of the biggest scalability challenges is getting Giraph to run
> reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> > 1) If the connection fails after the initial connection was made, the
> job will die.
> > 2) Requests must be completed exactly once.  This is difficult to
> implement, but required since we cannot have multiple retried requests
> succeed (i.e. a vertex gets more messages than expected).
> > 3) Sometimes there are unresolved addresses, causing failure.
> > This patch addresses these issues by re-establishing failed connections
> and keep tracking of every request sent to every worker.  If the request
> fails or passes a timeout, it will be resent.  The server will keep track
> of requests that succeeded to insure that the same request won't be
> processed more than once.  The structure for keeping track of the succeeded
> requests on the server is efficient for handling increasing request ids
> (IncreasingBitSet).  For handling unresolved addresses, I added retry logic
> to keep trying to resolve the problem.
> > This patch also adds several unit tests that use fault injection to
> simulate a lost response or a closed channel exception on the server.  It
> also has unittests for IncreasingBitSet to insure it is working correctly
> and efficiently.
> > This passes all unittests (including the new ones).  Additionally, I
> have some experience results as well.
> > Previously, I was unable to run reliably with more than 200 workers.
>  With this change I can reliably run 500+ workers.  I also ran with 600
> workers successfully.  This is a really big reliability win for us.
> > I can see the code working to do reconnections and re-issue requests
> when necessary.  It's very cool.
> > I.e.
> > 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Fixing disconnected channel to
> xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> > 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> > 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Fixing disconnected channel to
> xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> > 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message