giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Avery Ching" <avery.ch...@gmail.com>
Subject Re: Review Request: GIRAPH-300 : Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
Date Wed, 15 Aug 2012 18:52:47 GMT


> On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote:
> > This is a big win, thanks!
> > So it was a reliability issue masked as a scalability one: more workers -> increased
probability of network failure -> waiting forever on lost requests.
> > Now is there anything that can be done to minimize those failures in the first place,
or does it just depend on the cluster setup?
> > 
> > I didn't know we had that partitioning bug. When is updatePartitionOwners() called
concurrently with getPartitionOwner()? I guess we might be processing vertex requests while
doing the partition exchange?

Thanks for the review!

Next patch can improve the situation by retrying. =)  The problem with updatePartitionOwners()
is that it is called by the main thread in BspServiceWorker#setup(), but getPartitionOwners()
is called by the netty server threads.


> On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java,
line 389
> > <https://reviews.apache.org/r/6600/diff/1/?file=139832#file139832line389>
> >
> >     This looks a bit weird (using a TimedLogger for the timing but doing the actual
logging on the raw Logger) although I see where the problem is (need to call it multiple times
without waiting for the next deadline).

yeah, we can probably augment TimedLogger to have a level in a future patch.


> On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java,
line 63
> > <https://reviews.apache.org/r/6600/diff/1/?file=139833#file139833line63>
> >
> >     Missing a javadoc here.

Thanks.  Fixed.


> On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java,
line 51
> > <https://reviews.apache.org/r/6600/diff/1/?file=139845#file139845line51>
> >
> >     We could call isPrintable() here to avoid duplication.

Great suggestion.  Fixed.


- Avery


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6600/#review10330
-----------------------------------------------------------


On Aug. 15, 2012, 6:52 p.m., Avery Ching wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6600/
> -----------------------------------------------------------
> 
> (Updated Aug. 15, 2012, 6:52 p.m.)
> 
> 
> Review request for giraph.
> 
> 
> Description
> -------
> 
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and then matching
the request id to the response (minor refactoring of WritableRequest to make requests simpler
and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to help debug
failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes
divide by zero exceptions in real life)
> 
> 
> This addresses bug GIRAPH-300.
>     https://issues.apache.org/jira/browse/GIRAPH-300
> 
> 
> Diffs
> -----
> 
>   http://svn.apache.org/repos/asf/giraph/trunk/pom.xml 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java
1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java
1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java
PRE-CREATION 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java
1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java
1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMessagesRequest.java
1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMutationsRequest.java
1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendVertexRequest.java
1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java
1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java
1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java
1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/ConnectionTest.java
1372575 
> 
> Diff: https://reviews.apache.org/r/6600/diff/
> 
> 
> Testing
> -------
> 
> Currently, netty connection failures causes issues with more than 75 workers in my setup.
This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
> 
> This code passes the local Hadoop regressions and the single node Hadoop instance regressions.
It also succeeded on large runs (200+ workers) on a real Hadoop cluster.
> 
> 
> Thanks,
> 
> Avery Ching
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message