hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "BristolHadoopWorkshop" by SteveLoughran
Date Fri, 14 Aug 2009 17:44:19 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by SteveLoughran:

The comment on the change is:

  Could you run CMSSW under Hadoop? Probably not. Very slow startup/teardown cost, so you
don't want to just run it for one/two events.
- Issue: How to convince physicists to embrace MR? Need to see the benefits, as physicists
don't see/care about the the costs
+ Issue: How to convince physicists to embrace MR? Need to see the benefits, as physicists
don't see/care about the the costs.
+ == Graphs ==
+   * [http://www.slideshare.net/steve_l/graphs-1848617 Graphs] Paolo Castagna, HP
+ This was a talk by Paolo Castagna on graph work under MR, of which PageRank is classic application
+ * graph topology does not change every iteration, so why ship it around every MR?
+ * the graph defines the other jobs you need to communicate with. 
+ The graph is a massive data structure which, if you are doing inference work, only grows
in relationships. Steve thinks: You may need some graph model which is shared across servers,
which they can all add to. There is a small problem here: keeping the information current
for 4000 servers, but what if you don't have to, what if you treat updates to the graph as
lazy facts to propagate round? 
+ Google: pregel. what do you need from a language to describe PageRank in 15 lines?
+ MS: dryad does not do graphs.
+ Projects
+  * Apache Hamburg -proposed by Edward Yang, of Hamas, is it making progress? We need code.

+  * Apache Common Graph. Dead, pre-MapReduce code.
+ === Graph Algorithms ===
+ These are what a graph project should start with
+  * Graph Search
+  * Directed/acyclic graphs
+  * Minimum Spanning Tree
+  * Shortest Path
+  * Network Flow
+ Paolo claimed that Depth First Search, DFS, doesn't work in this space. if Depth First Search
doesn't work, what to do?
+ The current best printed work on Graph over MapReduce is the recent paper [http://www2.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120
Graph Twiddling in a MapReduce World], by Jonathan Cohen.
+  * You can't do transitive closure in SQL. 
+  * What would an efficient transitive closure algorithm over MR be?
+ === Discussion ===
+ There was discussion on handing big graphs in the system, ones where the graph itself is
very large. Someone need's to take Paolo's PageRank-over-MapReduce code and test it on bigger
data sets.
+ There was a good point on what is "efficient" in this world: If you can do it in O(n) on
MR, and there is no single computer in the world that can work on a graph of that scale, then
it's efficient.
+ The key point here being that yes, something done as a chain of MR jobs on a Hadoop cluster
may seem an inefficient approach, but if there is no other way to store that much data, or
run through it, then graph people will be happy.

View raw message