giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eli Reisman <initialcont...@gmail.com>
Subject Re: Review Request: GIRAPH-306: Netty requests should be reliable and implement exactly once semantics
Date Wed, 22 Aug 2012 04:21:11 GMT
I found there was a bit more pressure on memory and a bit more time to
finish the job, but not an outlandish amount of either.


On Tue, Aug 21, 2012 at 9:53 AM, Alessandro Presta <alessandro@fb.com>wrote:

>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6687/#review10573
> -----------------------------------------------------------
>
>
> Really cool! It looks like some network failures are unavoidable scaling
> up, so resending requests on one end and ensuring unicity on the other
> seems a solid approach.
> From your numbers, it's a pretty dramatic step up and I can't wait to try
> it.
>
> Is there any significant impact on jobs that wouldn't require the
> additional precautions? In other words, if you take a job that succeeds
> with current trunk, do the communication-heavy phases take considerably
> longer?
>
> Other than that, only some comments about maintainability.
>
> Thanks!
>
>
>
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
> <https://reviews.apache.org/r/6687/#comment22634>
>
>     Unrelated to your changes, but I think these should go in GiraphJob
> like all other config options, even if Netty-specific.
>
>
>
>
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
> <https://reviews.apache.org/r/6687/#comment22635>
>
>     Can you put a more explanatory comment here? You're just making the
> previously hardcoded WAITING_REQUEST_MSECS parameter configurable, right?
>     So maybe something like "Waiting interval for outstanding requests to
> finish".
>
>
>
>
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
> <https://reviews.apache.org/r/6687/#comment22636>
>
>     I would turn this 13 into a named constant and briefly explain how you
> get to this number, otherwise this is difficult to maintain.
>
>
>
>
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
> <https://reviews.apache.org/r/6687/#comment22642>
>
>     Well spotted! No reason to synchronize the entire while loop.
>
>
>
>
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java
> <https://reviews.apache.org/r/6687/#comment22644>
>
>     I normally don't like having test code paths in non-test classes. Is
> there absolutely no way around this?
>
>
>
>
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java
> <https://reviews.apache.org/r/6687/#comment22645>
>
>     Typo (missing blank).
>
>
>
>
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerIdRequestId.java
> <https://reviews.apache.org/r/6687/#comment22649>
>
>     Just a matter of taste, but I think we could use a more meaningful
> name here.
>     Some ideas: GlobalRequestId (as this is a global, unique identifier),
> WorkerRequestIdPair (super-explicit).
>     Also I would specify in the comments that we're referring to the
> destination worker.
>
>
> - Alessandro Presta
>
>
> On Aug. 20, 2012, 6:47 p.m., Avery Ching wrote:
> >
> > -----------------------------------------------------------
> > This is an automatically generated e-mail. To reply, visit:
> > https://reviews.apache.org/r/6687/
> > -----------------------------------------------------------
> >
> > (Updated Aug. 20, 2012, 6:47 p.m.)
> >
> >
> > Review request for giraph.
> >
> >
> > Description
> > -------
> >
> > One of the biggest scalability challenges is getting Giraph to run
> reliably on a large number of tasks (i.e. > 200). Several problems exist:
> >
> > 1) If the connection fails after the initial connection was made, the
> job will die.
> > 2) Requests must be completed exactly once. This is difficult to
> implement, but required since we cannot have multiple retried requests
> succeed (i.e. a vertex gets more messages than expected).
> > 3) Sometimes there are unresolved addresses, causing failure.
> >
> > This patch addresses these issues by re-establishing failed connections
> and keep tracking of every request sent to every worker. If the request
> fails or passes a timeout, it will be resent. The server will keep track of
> requests that succeeded to insure that the same request won't be processed
> more than once. The structure for keeping track of the succeeded requests
> on the server is efficient for handling increasing request ids
> (IncreasingBitSet). For handling unresolved addresses, I added retry logic
> to keep trying to resolve the problem.
> >
> > This patch also adds several unit tests that use fault injection to
> simulate a lost response or a closed channel exception on the server. It
> also has unittests for IncreasingBitSet to insure it is working correctly
> and efficiently.
> >
> > This passes all unittests (including the new ones). Additionally, I have
> some experience results as well.
> >
> > Previously, I was unable to run reliably with more than 200 workers.
> With this change I can reliably run 500+ workers. I also ran with 600
> workers successfully. This is a really big reliability win for us.
> >
> > I can see the code working to do reconnections and re-issue requests
> when necessary. It's very cool.
> >
> > I.e.
> >
> > 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Fixing disconnected channel to
> xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> > 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> > 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Fixing disconnected channel to
> xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> > 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!
> >
> >
> > This addresses bug GIRAPH-306.
> >     https://issues.apache.org/jira/browse/GIRAPH-306
> >
> >
> > Diffs
> > -----
> >
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/AddressRequestIdGenerator.javaPRE-CREATION
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ChannelRotater.java1374192
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/IncreasingBitSet.javaPRE-CREATION
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java1374192
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java1374192
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java1374192
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestDecoder.java1374192
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java1374192
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java1374192
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java1374192
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerIdRequestId.javaPRE-CREATION
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerRequestReservedMap.javaPRE-CREATION
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java1374192
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java1374192
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java1374192
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java1374192
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java1374192
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/IncreasingBitSetTest.javaPRE-CREATION
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestFailureTest.javaPRE-CREATION
> >
> http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestTest.java1374192
> >
> > Diff: https://reviews.apache.org/r/6687/diff/
> >
> >
> > Testing
> > -------
> >
> > mvn clean verify
> > Lots of large test 500-600 workers with PageRankBenchmark
> >
> >
> > Thanks,
> >
> > Avery Ching
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message