giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Avery Ching" <avery.ch...@gmail.com>
Subject Re: Review Request: GIRAPH-306: Netty requests should be reliable and implement exactly once semantics
Date Tue, 21 Aug 2012 19:47:21 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6687/
-----------------------------------------------------------

(Updated Aug. 21, 2012, 7:47 p.m.)


Review request for giraph.


Changes
-------

Addressed Alessandro's suggestions.


Description
-------

One of the biggest scalability challenges is getting Giraph to run reliably on a large number
of tasks (i.e. > 200). Several problems exist:

1) If the connection fails after the initial connection was made, the job will die.
2) Requests must be completed exactly once. This is difficult to implement, but required since
we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
3) Sometimes there are unresolved addresses, causing failure.

This patch addresses these issues by re-establishing failed connections and keep tracking
of every request sent to every worker. If the request fails or passes a timeout, it will be
resent. The server will keep track of requests that succeeded to insure that the same request
won't be processed more than once. The structure for keeping track of the succeeded requests
on the server is efficient for handling increasing request ids (IncreasingBitSet). For handling
unresolved addresses, I added retry logic to keep trying to resolve the problem.

This patch also adds several unit tests that use fault injection to simulate a lost response
or a closed channel exception on the server. It also has unittests for IncreasingBitSet to
insure it is working correctly and efficiently.

This passes all unittests (including the new ones). Additionally, I have some experience results
as well.

Previously, I was unable to run reliably with more than 200 workers. With this change I can
reliably run 500+ workers. I also ran with 600 workers successfully. This is a really big
reliability win for us.

I can see the code working to do reconnections and re-issue requests when necessary. It's
very cool.

I.e.

2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing
disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected
to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing
disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected
to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!


This addresses bug GIRAPH-306.
    https://issues.apache.org/jira/browse/GIRAPH-306


Diffs (updated)
-----

  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/AddressRequestIdGenerator.java
PRE-CREATION 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ChannelRotater.java
1375393 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ClientRequestId.java
PRE-CREATION 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/IncreasingBitSet.java
PRE-CREATION 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
1375393 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java
1375393 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java
1375393 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestDecoder.java
1375393 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java
1375393 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java
1375393 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java
1375393 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerRequestReservedMap.java
PRE-CREATION 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java
1375393 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
1375393 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
1375393 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
1375393 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java
1375393 
  http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/IncreasingBitSetTest.java
PRE-CREATION 
  http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestFailureTest.java
PRE-CREATION 
  http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestTest.java
1375393 

Diff: https://reviews.apache.org/r/6687/diff/


Testing
-------

mvn clean verify
Lots of large test 500-600 workers with PageRankBenchmark


Thanks,

Avery Ching


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message