hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "BristolHadoopWorkshop" by SteveLoughran
Date Fri, 14 Aug 2009 17:44:19 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/BristolHadoopWorkshop

The comment on the change is:
graphs

------------------------------------------------------------------------------
  
  Could you run CMSSW under Hadoop? Probably not. Very slow startup/teardown cost, so you
don't want to just run it for one/two events.
  
- Issue: How to convince physicists to embrace MR? Need to see the benefits, as physicists
don't see/care about the the costs
+ Issue: How to convince physicists to embrace MR? Need to see the benefits, as physicists
don't see/care about the the costs.
  
+ == Graphs ==
+ 
+   * [http://www.slideshare.net/steve_l/graphs-1848617 Graphs] Paolo Castagna, HP
+ 
+ This was a talk by Paolo Castagna on graph work under MR, of which PageRank is classic application
+ * graph topology does not change every iteration, so why ship it around every MR?
+ * the graph defines the other jobs you need to communicate with. 
+ The graph is a massive data structure which, if you are doing inference work, only grows
in relationships. Steve thinks: You may need some graph model which is shared across servers,
which they can all add to. There is a small problem here: keeping the information current
for 4000 servers, but what if you don't have to, what if you treat updates to the graph as
lazy facts to propagate round? 
+ 
+ Google: pregel. what do you need from a language to describe PageRank in 15 lines?
+ 
+ MS: dryad does not do graphs.
+ 
+ Projects
+  * Apache Hamburg -proposed by Edward Yang, of Hamas, is it making progress? We need code.

+  * Apache Common Graph. Dead, pre-MapReduce code.
+ 
+ === Graph Algorithms ===
+ 
+ These are what a graph project should start with
+  * Graph Search
+  * Directed/acyclic graphs
+  * Minimum Spanning Tree
+  * Shortest Path
+  * Network Flow
+ 
+ Paolo claimed that Depth First Search, DFS, doesn't work in this space. if Depth First Search
doesn't work, what to do?
+ 
+ The current best printed work on Graph over MapReduce is the recent paper [http://www2.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120
Graph Twiddling in a MapReduce World], by Jonathan Cohen.
+ 
+  * You can't do transitive closure in SQL. 
+  * What would an efficient transitive closure algorithm over MR be?
+ 
+ 
+ === Discussion ===
+ 
+ There was discussion on handing big graphs in the system, ones where the graph itself is
very large. Someone need's to take Paolo's PageRank-over-MapReduce code and test it on bigger
data sets.
+ 
+ There was a good point on what is "efficient" in this world: If you can do it in O(n) on
MR, and there is no single computer in the world that can work on a graph of that scale, then
it's efficient.
+ 
+ The key point here being that yes, something done as a chain of MR jobs on a Hadoop cluster
may seem an inefficient approach, but if there is no other way to store that much data, or
run through it, then graph people will be happy.
+ 

Mime
View raw message