Return-Path: X-Original-To: apmail-incubator-giraph-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-giraph-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 18A86969A for ; Thu, 3 May 2012 18:17:14 +0000 (UTC) Received: (qmail 79943 invoked by uid 500); 3 May 2012 18:17:13 -0000 Delivered-To: apmail-incubator-giraph-dev-archive@incubator.apache.org Received: (qmail 79916 invoked by uid 500); 3 May 2012 18:17:13 -0000 Mailing-List: contact giraph-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: giraph-dev@incubator.apache.org Delivered-To: mailing list giraph-dev@incubator.apache.org Received: (qmail 79906 invoked by uid 99); 3 May 2012 18:17:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 May 2012 18:17:13 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 May 2012 18:17:11 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id E39AB42EEC5 for ; Thu, 3 May 2012 18:16:50 +0000 (UTC) Date: Thu, 3 May 2012 18:16:50 +0000 (UTC) From: "Avery Ching (JIRA)" To: giraph-dev@incubator.apache.org Message-ID: <225060534.22957.1336069010933.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <606750719.57622.1327006360677.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Assigned] (GIRAPH-127) Extending the API with a master.compute() function. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/GIRAPH-127?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:all-tabpanel ] Avery Ching reassigned GIRAPH-127: ---------------------------------- Assignee: Semih Salihoglu Looking forward to this. =20 > Extending the API with a master.compute() function. > --------------------------------------------------- > > Key: GIRAPH-127 > URL: https://issues.apache.org/jira/browse/GIRAPH-127 > Project: Giraph > Issue Type: New Feature > Components: bsp, examples, graph > Reporter: Semih Salihoglu > Assignee: Semih Salihoglu > > First of all, sorry for the long explanation to this feature. > I want to expand the API of Giraph with a new function called master.comp= ute(), that would get called at the master before each superstep and I will= try to explain the purpose that it would serve with an example. Let's say = we want to implement the following simplified version of the k-means cluste= ring algorithm. Pseudocode below: > * Input G(V, E), k, numEdgesThreshold, maxIterations > * Algorithm: > * int numEdgesCrossingClusters =3D Integer.MAX_INT; > * int iterationNo =3D 0; > * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < = maxIterations) { > * iterationNo++; > * int[] clusterCenters =3D pickKClusterCenters(k, G); > * findClusterCenters(G, clusterCenters); > * numEdgesCrossingClusters =3D countNumEdgesCrossingClusters(); > * } > The algorithm goes through the following steps in iterations: > 1) Pick k random initial cluster centers > 2) Assign each vertex to the cluster center that it's closest to (in Gira= ph, this can be implemented in message passing similar to how ShortestPaths= is implemented): > 3) Count the nuimber of edges crossing clusters > 4) Go back to step 1, if there are a lot of edges crossing clusters and w= e haven't exceeded maximum number of iterations yet. > In an algorithm like this, step 2 and 3 are where most of the work happen= s and both parts have very neat message-passing implementations. I'll try t= o give an overview without going into the details. Let's say we define a Ve= rtex in Giraph to hold a custom Writable object that holds 2 integer values= and sends a message with upto 2 integer values. > Step 2 is very similar to ShortestPaths algorithm and has two stages: In = the first stage, each vertex checks to see whether or not it's one of the c= luster centers. If so, it assigns itself the value (id, 0), otherwise it as= signs itself (Null, Null). In the 2nd stage, the vertices assign themselves= to the minimum distance cluster center by looking at their neighbors (clus= ter centers, distance) values (received as 2 integer messages) and their cu= rrent values, and changing their values if they find a lower distance clust= er center. This happens in x number of supersteps until every vertex conver= ges. > Step 3, counting the number of edges crossing clusters, is also very easy= to implement in Giraph. Once each vertex has a cluster center, the number = of edges crossing clusters can be counted by an aggregator, let's say calle= d "num-edges-crossing". It would again have two stages: First stage, every = vertex just sends its cluster id to all its neighbors. Second stage, every = vertex looks at their neighbors' cluster ids in the messages, and for each = cluster id that is not equal to its own cluster id, it increments "num-edge= s-crossing" by 1. > The other 2 steps, step 1 and 4, are very simple sequential computations.= Step 1 just picks k random vertex ids and puts it into an aggregator. Step= 4 just compares "num-edges-crossing" by a threshold and also checks whethe= r or not the algorithm has exceeded maxIterations (not supersteps but itera= tions of going through Steps 1-4). With the current API, it's not clear whe= re to do these computations. There is a per worker function preSuperstep() = that can be implemented, but if we decide to pick a special worker, let's s= ay worker 1, to pick the k vertices then we'd waste an entire superstep wh= ere only worker 1 would do work, (by picking k vertices in preSuperstep() = and put them into an aggregator), and all other workers would be idle. Tryi= ng to do this in worker 1 in postSuperstep() would not work either because,= worker 1 needs to know that all the vertices have converged to understand = that it's time to pick k vertices or it's time do check in step 4, which wo= uld only be available to it in the beginning of the next superstep. > A master.compute() extension would run at the master and before the super= step and would modify the aggregator that would keep the k vertices before = the aggregators are broadcast to the workers, which are all very short sequ= ential computations, so they would not waste resources the way a preSuperst= ep() or postSuperstep() approach would do. It would also enable running new= algorithms like kmeans that are composed of very vertex-centric computatio= ns glued together by small sequential ones. It would basically boost Giraph= with sequential computation in a non-wasteful way. > I am a phd student at Stanford and I have been working on my own BSP/Preg= el implementation since last year. It's called GPS. I haven't distributed i= t, mainly because in September I learned about Giraph and I decided to slow= down on working on it :). We have basically been using GPS as our own rese= arch platform. The source code for GPS is here if any one is interested (ht= tps://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the mas= ter.compute() feature in GPS, and here's an example of KMeans implementatio= n in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-pr= ojects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called Gl= obalObjects in GPS). There is another example (https://subversion.assembla.= com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/= ), which I'll skip explaining because it's very detailed and would make the= similar points that I am trying to make with k-means. Master.compute() in = general would make it possible to glue together any graph algorithm that is= composed of multiple stages with different message types and computations = that is conducive to run with vertex.compute(). There are many examples of = such algorithms: recursive partitioning, triangle counting, even much simpl= er things like finding shortests paths for 100 vertices in pieces (first to= 5 vertices, then to another 5, then to another 5, etc..), which would be g= ood because trying to find shortests paths to 100 vertices require a very l= arge messages (would need to store 100 integers per message)). > If the Giraph team approves, I would like to take a similar approach in i= mplementing this feature in Giraph as I've done in GPS. Overall: > Add a Master.java to org.apache.giraph.graph, that is default Master, wit= h a compute function that by default aggregates all aggregators and does th= e check of whether or not the computation has ended (by comparining numVert= ices with numFinishedVertices). This would be a refactoring of org.apache.g= iraph.graph.BspServiceMaster class (as far as I can see). > Extend GiraphJob to have a setMaster() method to set a master class (by d= efault it would be the default master above) > The rest would be sending the custom master class to probably all workers= but only the master would instantiate it with reflection. I need to learn = more on how to do these, I am not familiar with that part of the Giraph cod= e base yet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs: https://issues.apache.org/jira/secure/ContactAdministrators!default.jsp= a For more information on JIRA, see: http://www.atlassian.com/software/jira