Return-Path: X-Original-To: apmail-giraph-dev-archive@www.apache.org Delivered-To: apmail-giraph-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CA72CD958 for ; Wed, 15 Aug 2012 18:54:57 +0000 (UTC) Received: (qmail 2529 invoked by uid 500); 15 Aug 2012 18:54:57 -0000 Delivered-To: apmail-giraph-dev-archive@giraph.apache.org Received: (qmail 2488 invoked by uid 500); 15 Aug 2012 18:54:57 -0000 Mailing-List: contact dev-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@giraph.apache.org Delivered-To: mailing list dev@giraph.apache.org Delivered-To: moderator for dev@giraph.apache.org Received: (qmail 96751 invoked by uid 99); 15 Aug 2012 18:52:48 -0000 Content-Type: multipart/alternative; boundary="===============1293394148028824492==" MIME-Version: 1.0 Subject: Re: Review Request: GIRAPH-300 : Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning From: "Avery Ching" To: "Avery Ching" , "giraph" , "Alessandro Presta" Date: Wed, 15 Aug 2012 18:52:47 -0000 Message-ID: <20120815185247.23839.80592@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org Auto-Submitted: auto-generated Sender: "Avery Ching" X-ReviewGroup: giraph X-ReviewRequest-URL: https://reviews.apache.org/r/6600/ X-Sender: "Avery Ching" References: <20120815140321.23841.98451@reviews.apache.org> In-Reply-To: <20120815140321.23841.98451@reviews.apache.org> Reply-To: "Avery Ching" --===============1293394148028824492== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable > On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote: > > This is a big win, thanks! > > So it was a reliability issue masked as a scalability one: more workers= -> increased probability of network failure -> waiting forever on lost req= uests. > > Now is there anything that can be done to minimize those failures in th= e first place, or does it just depend on the cluster setup? > > = > > I didn't know we had that partitioning bug. When is updatePartitionOwne= rs() called concurrently with getPartitionOwner()? I guess we might be proc= essing vertex requests while doing the partition exchange? Thanks for the review! Next patch can improve the situation by retrying. =3D) The problem with up= datePartitionOwners() is that it is called by the main thread in BspService= Worker#setup(), but getPartitionOwners() is called by the netty server thre= ads. > On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote: > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/comm/NettyClient.java, line 389 > > > > > > This looks a bit weird (using a TimedLogger for the timing but doin= g the actual logging on the raw Logger) although I see where the problem is= (need to call it multiple times without waiting for the next deadline). yeah, we can probably augment TimedLogger to have a level in a future patch. > On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote: > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/comm/NettyServer.java, line 63 > > > > > > Missing a javadoc here. Thanks. Fixed. > On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote: > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/utils/TimedLogger.java, line 51 > > > > > > We could call isPrintable() here to avoid duplication. Great suggestion. Fixed. - Avery ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/6600/#review10330 ----------------------------------------------------------- On Aug. 15, 2012, 6:52 p.m., Avery Ching wrote: > = > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/6600/ > ----------------------------------------------------------- > = > (Updated Aug. 15, 2012, 6:52 p.m.) > = > = > Review request for giraph. > = > = > Description > ------- > = > * Upgrade to the most recent stable version of Netty (3.5.3.Final) > * Try multiple connection attempts up to n failures > * Track requests throughout the system by keeping track of the request id= and then matching the request id to the response (minor refactoring of Wri= tableRequest to make requests simpler and support the request id) > * Improved handling of netty exceptions by dumping the exception stack to= help debug failures > * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe = (this causes divide by zero exceptions in real life) > = > = > This addresses bug GIRAPH-300. > https://issues.apache.org/jira/browse/GIRAPH-300 > = > = > Diffs > ----- > = > http://svn.apache.org/repos/asf/giraph/trunk/pom.xml 1372575 = > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/comm/NettyClient.java 1372575 = > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/comm/NettyServer.java 1372575 = > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/comm/NettyWorkerClient.java 1372575 = > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/comm/RequestInfo.java PRE-CREATION = > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/comm/RequestServerHandler.java 1372575 = > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/comm/ResponseClientHandler.java 1372575 = > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/comm/SendPartitionMessagesRequest.java 1372575 = > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/comm/SendPartitionMutationsRequest.java 1372575 = > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/comm/SendVertexRequest.java 1372575 = > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/comm/WritableRequest.java 1372575 = > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/graph/BspServiceMaster.java 1372575 = > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/graph/GiraphJob.java 1372575 = > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/graph/partition/HashWorkerPartitioner.java 1372575 = > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/g= iraph/utils/TimedLogger.java 1372575 = > http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/g= iraph/comm/ConnectionTest.java 1372575 = > = > Diff: https://reviews.apache.org/r/6600/diff/ > = > = > Testing > ------- > = > Currently, netty connection failures causes issues with more than 75 work= ers in my setup. This allows us to reach over 200+ in a reasonably reliable= network that doesn't kill connections. > = > This code passes the local Hadoop regressions and the single node Hadoop = instance regressions. It also succeeded on large runs (200+ workers) on a r= eal Hadoop cluster. > = > = > Thanks, > = > Avery Ching > = > --===============1293394148028824492==--