Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/BristolHadoopWorkshop
The comment on the change is:
graphs
------------------------------------------------------------------------------
Could you run CMSSW under Hadoop? Probably not. Very slow startup/teardown cost, so you don't want to just run it for one/two events.
- Issue: How to convince physicists to embrace MR? Need to see the benefits, as physicists don't see/care about the the costs
+ Issue: How to convince physicists to embrace MR? Need to see the benefits, as physicists don't see/care about the the costs.
+ == Graphs ==
+
+ * [http://www.slideshare.net/steve_l/graphs-1848617 Graphs] Paolo Castagna, HP
+
+ This was a talk by Paolo Castagna on graph work under MR, of which PageRank is classic application
+ * graph topology does not change every iteration, so why ship it around every MR?
+ * the graph defines the other jobs you need to communicate with.
+ The graph is a massive data structure which, if you are doing inference work, only grows in relationships. Steve thinks: You may need some graph model which is shared across servers, which they can all add to. There is a small problem here: keeping the information current for 4000 servers, but what if you don't have to, what if you treat updates to the graph as lazy facts to propagate round?
+
+ Google: pregel. what do you need from a language to describe PageRank in 15 lines?
+
+ MS: dryad does not do graphs.
+
+ Projects
+ * Apache Hamburg -proposed by Edward Yang, of Hamas, is it making progress? We need code.
+ * Apache Common Graph. Dead, pre-MapReduce code.
+
+ === Graph Algorithms ===
+
+ These are what a graph project should start with
+ * Graph Search
+ * Directed/acyclic graphs
+ * Minimum Spanning Tree
+ * Shortest Path
+ * Network Flow
+
+ Paolo claimed that Depth First Search, DFS, doesn't work in this space. if Depth First Search doesn't work, what to do?
+
+ The current best printed work on Graph over MapReduce is the recent paper [http://www2.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120 Graph Twiddling in a MapReduce World], by Jonathan Cohen.
+
+ * You can't do transitive closure in SQL.
+ * What would an efficient transitive closure algorithm over MR be?
+
+
+ === Discussion ===
+
+ There was discussion on handing big graphs in the system, ones where the graph itself is very large. Someone need's to take Paolo's PageRank-over-MapReduce code and test it on bigger data sets.
+
+ There was a good point on what is "efficient" in this world: If you can do it in O(n) on MR, and there is no single computer in the world that can work on a graph of that scale, then it's efficient.
+
+ The key point here being that yes, something done as a chain of MR jobs on a Hadoop cluster may seem an inefficient approach, but if there is no other way to store that much data, or run through it, then graph people will be happy.
+