giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alessandro Presta" <alessan...@fb.com>
Subject Re: Review Request: GIRAPH-306: Netty requests should be reliable and implement exactly once semantics
Date Tue, 21 Aug 2012 16:53:01 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6687/#review10573
-----------------------------------------------------------


Really cool! It looks like some network failures are unavoidable scaling up, so resending
requests on one end and ensuring unicity on the other seems a solid approach.
>From your numbers, it's a pretty dramatic step up and I can't wait to try it.

Is there any significant impact on jobs that wouldn't require the additional precautions?
In other words, if you take a job that succeeds with current trunk, do the communication-heavy
phases take considerably longer?

Other than that, only some comments about maintainability.

Thanks!


http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
<https://reviews.apache.org/r/6687/#comment22634>

    Unrelated to your changes, but I think these should go in GiraphJob like all other config
options, even if Netty-specific.



http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
<https://reviews.apache.org/r/6687/#comment22635>

    Can you put a more explanatory comment here? You're just making the previously hardcoded
WAITING_REQUEST_MSECS parameter configurable, right?
    So maybe something like "Waiting interval for outstanding requests to finish".



http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
<https://reviews.apache.org/r/6687/#comment22636>

    I would turn this 13 into a named constant and briefly explain how you get to this number,
otherwise this is difficult to maintain.



http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
<https://reviews.apache.org/r/6687/#comment22642>

    Well spotted! No reason to synchronize the entire while loop.



http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java
<https://reviews.apache.org/r/6687/#comment22644>

    I normally don't like having test code paths in non-test classes. Is there absolutely
no way around this?



http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java
<https://reviews.apache.org/r/6687/#comment22645>

    Typo (missing blank).



http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerIdRequestId.java
<https://reviews.apache.org/r/6687/#comment22649>

    Just a matter of taste, but I think we could use a more meaningful name here.
    Some ideas: GlobalRequestId (as this is a global, unique identifier), WorkerRequestIdPair
(super-explicit).
    Also I would specify in the comments that we're referring to the destination worker.


- Alessandro Presta


On Aug. 20, 2012, 6:47 p.m., Avery Ching wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6687/
> -----------------------------------------------------------
> 
> (Updated Aug. 20, 2012, 6:47 p.m.)
> 
> 
> Review request for giraph.
> 
> 
> Description
> -------
> 
> One of the biggest scalability challenges is getting Giraph to run reliably on a large
number of tasks (i.e. > 200). Several problems exist:
> 
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once. This is difficult to implement, but required
since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than
expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> 
> This patch addresses these issues by re-establishing failed connections and keep tracking
of every request sent to every worker. If the request fails or passes a timeout, it will be
resent. The server will keep track of requests that succeeded to insure that the same request
won't be processed more than once. The structure for keeping track of the succeeded requests
on the server is efficient for handling increasing request ids (IncreasingBitSet). For handling
unresolved addresses, I added retry logic to keep trying to resolve the problem.
> 
> This patch also adds several unit tests that use fault injection to simulate a lost response
or a closed channel exception on the server. It also has unittests for IncreasingBitSet to
insure it is working correctly and efficiently.
> 
> This passes all unittests (including the new ones). Additionally, I have some experience
results as well.
> 
> Previously, I was unable to run reliably with more than 200 workers. With this change
I can reliably run 500+ workers. I also ran with 600 workers successfully. This is a really
big reliability win for us.
> 
> I can see the code working to do reconnections and re-issue requests when necessary.
It's very cool.
> 
> I.e.
> 
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel:
Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel:
Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel:
Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel:
Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!
> 
> 
> This addresses bug GIRAPH-306.
>     https://issues.apache.org/jira/browse/GIRAPH-306
> 
> 
> Diffs
> -----
> 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/AddressRequestIdGenerator.java
PRE-CREATION 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ChannelRotater.java
1374192 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/IncreasingBitSet.java
PRE-CREATION 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
1374192 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java
1374192 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java
1374192 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestDecoder.java
1374192 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java
1374192 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java
1374192 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java
1374192 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerIdRequestId.java
PRE-CREATION 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerRequestReservedMap.java
PRE-CREATION 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java
1374192 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
1374192 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
1374192 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
1374192 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java
1374192 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/IncreasingBitSetTest.java
PRE-CREATION 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestFailureTest.java
PRE-CREATION 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestTest.java
1374192 
> 
> Diff: https://reviews.apache.org/r/6687/diff/
> 
> 
> Testing
> -------
> 
> mvn clean verify
> Lots of large test 500-600 workers with PageRankBenchmark
> 
> 
> Thanks,
> 
> Avery Ching
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message