giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Avery Ching" <avery.ch...@gmail.com>
Subject Review Request: GIRAPH-300 : Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
Date Tue, 14 Aug 2012 07:32:06 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6600/
-----------------------------------------------------------

Review request for giraph.


Description
-------

* Upgrade to the most recent stable version of Netty (3.5.3.Final)
* Try multiple connection attempts up to n failures
* Track requests throughout the system by keeping track of the request id and then matching
the request id to the response (minor refactoring of WritableRequest to make requests simpler
and support the request id)
* Improved handling of netty exceptions by dumping the exception stack to help debug failures
* Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide
by zero exceptions in real life)


This addresses bug GIRAPH-300.
    https://issues.apache.org/jira/browse/GIRAPH-300


Diffs
-----

  http://svn.apache.org/repos/asf/giraph/trunk/pom.xml 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java
1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java
1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java
PRE-CREATION 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java
1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java
1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMessagesRequest.java
1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMutationsRequest.java
1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendVertexRequest.java
1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java
1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java
1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java
1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/ConnectionTest.java
1372575 

Diff: https://reviews.apache.org/r/6600/diff/


Testing
-------

Currently, netty connection failures causes issues with more than 75 workers in my setup.
This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.

This code passes the local Hadoop regressions and the single node Hadoop instance regressions.
It also succeeded on large runs (200+ workers) on a real Hadoop cluster.


Thanks,

Avery Ching


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message