From commits-return-30027-archive-asf-public=cust-asf.ponee.io@tinkerpop.apache.org Fri Jun 15 14:24:50 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 19BAB18066C for ; Fri, 15 Jun 2018 14:24:49 +0200 (CEST) Received: (qmail 41644 invoked by uid 500); 15 Jun 2018 12:24:49 -0000 Mailing-List: contact commits-help@tinkerpop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tinkerpop.apache.org Delivered-To: mailing list commits@tinkerpop.apache.org Received: (qmail 41635 invoked by uid 99); 15 Jun 2018 12:24:49 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Jun 2018 12:24:49 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id B9561E1134; Fri, 15 Jun 2018 12:24:48 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: spmallette@apache.org To: commits@tinkerpop.apache.org Date: Fri, 15 Jun 2018 12:24:49 -0000 Message-Id: <7beb2c1cc5c54e9a967b722bbbbbb928@git.apache.org> In-Reply-To: References: X-Mailer: ASF-Git Admin Mailer Subject: [2/3] tinkerpop git commit: Extended the connected-components recipe Extended the connected-components recipe Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/b0878227 Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/b0878227 Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/b0878227 Branch: refs/heads/TINKERPOP-1967 Commit: b087822708707013f7f0cd3b5abaf6d0f574a72e Parents: 28d4b02 Author: HadoopMarc Authored: Sun Jun 10 15:17:17 2018 +0200 Committer: HadoopMarc Committed: Wed Jun 13 21:43:54 2018 +0200 ---------------------------------------------------------------------- docs/src/recipes/connected-components.asciidoc | 94 ++++++++++++-------- docs/static/images/cc-scale-ratio.png | Bin 0 -> 14393 bytes docs/static/images/cc-scale-size.png | Bin 0 -> 12220 bytes 3 files changed, 58 insertions(+), 36 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/b0878227/docs/src/recipes/connected-components.asciidoc ---------------------------------------------------------------------- diff --git a/docs/src/recipes/connected-components.asciidoc b/docs/src/recipes/connected-components.asciidoc index edbeec5..850c31f 100644 --- a/docs/src/recipes/connected-components.asciidoc +++ b/docs/src/recipes/connected-components.asciidoc @@ -31,11 +31,11 @@ Depending on the size of the graph, three solution regimes can be discriminated: 1. Small graphs that fit in the memory of a single machine -2. Medium graphs backed by storage for which a linear scan is still feasible. This regime is left to third party +2. Medium-sized graphs backed by storage for which an OLTP linear scan is still feasible. This regime is left to third party TinkerPop implementations, since TinkerPop itself has no storage-backed reference implementations. The idea is that component membership is stored in the graph, rather than in memory. -3. Large graphs requiring an OLAP approach to yield results in a reasonable time. +3. Large graphs requiring an approach with `HadoopGraph` and `SparkGraphComputer` to yield results in a reasonable time. These regimes are discussed separately using the following graph with three weakly connected components: @@ -55,16 +55,21 @@ g.addV().property(id, "A").as("a"). addE("link").from("d").to("e").iterate() ---- +==== Small graph traversals -===== Small graphs - -Connected components in a small graph can be determined with both an OLTP traversal and the OLAP +Connected components in a small graph can be determined with either an OLTP traversal or the OLAP `connectedComponent()`-step. The `connectedComponent()`-step is available as of TinkerPop 3.4.0 and is described in more detail in the link:http://tinkerpop.apache.org/docs/x.y.z/reference/#connectedcomponent-step[Reference Documentation]. +The traversal looks like: +[gremlin-groovy,existing] +---- +g.withComputer().V().connectedComponent(). + group().by('gremlin.connectedComponentVertexProgram.component'). + select(values).unfold() +---- A straightforward way to detect the various subgraphs with an OLTP traversal is to do this: - [gremlin-groovy,existing] ---- g.V().emit(cyclicPath().or().not(both())). <1> @@ -73,7 +78,6 @@ g.V().emit(cyclicPath().or().not(both())). <1 by(path().unfold().dedup().fold()). <4> select(values).unfold() <5> ---- - <1> The initial emit() step allows for output of isolated vertices, in addition to the discovery of components as described in (2). @@ -83,7 +87,7 @@ path. Collection `'a'` is used to keep track of visited vertices, for both subt and new traversals resulting from the `g.V()` linear scan. <3> While `'a'` nicely keeps track of vertices already visited, the actual components need to be extracted from the -path information of surviving traversers. The `path().unfold().limit(1)` closure provides the starting vertex +path information. The `path().unfold().limit(1)` closure provides the starting vertex of surviving traversers, which can be used to group the components. <4> This clause collects the unique vertices from all paths with the same starting vertex, thus from the same @@ -91,39 +95,57 @@ weak component. <5> The values of the groupby map contain the lists of vertices making up the requested components. -This algorithm completes in linear time with the number of vertices and edges, because a traversal is started for each -vertex and each edge with its associated out-vertex is visited exactly once. - -==== Large graphs -Large graphs require an OLAP solution with a custom VertexProgram that can be run using a graph implementation's -GraphComputer, in particular `SparkGraphComputer` on a `HadoopGraph`. The OLAP solution also runs faster for most -medium-sized graphs, that is when these graph have a 'natural' structure with a limited maximum path length. +==== Small graph scalability -The TinkerPop library of vertex programs contains the `WeakComponentsVertexProgram` which can be run in the same -way as the link:http://tinkerpop.apache.org/docs/x.y.z/reference/#peerpressurevertexprogram[PeerPressureVertexProgram]: +The scalability of the OLTP traversal and the `connectedComponent()`-step for in-memory graphs is shown in the figures +below. -[gremlin-groovy,existing] ----- -result = g.getGraph().compute(). - program(WeakComponentsVertexProgram.build().maxIterations(100).create()). - mapReduce(ClusterPopulationMapReduce.build().create()). - mapReduce(ClusterCountMapReduce.build().create()). - submit().get() -result.memory().clusterPopulation -gResult = result.graph().traversal() -gResult.V().valueMap(true) ----- +[[cc-scale-size]] +.Run times for finding connected components in a randomly generated graph with 10 components of equal size and with an edge/vertex ratio of 6 +image::cc-scale-size.png[width=600, side=bottom] -The vertex program has interconnected vertices exchange id's and store the lowest id until no vertex receives a -lower id. This algorithm is commonly applied in +In general, the `connectedComponent()`-step is almost a factor two faster than the OLTP traversal. Only, for very +small graphs the overhead of running the ConnectedComponentVertexProgram is larger than that of the OLTP traversal. +The vertex program works by having interconnected vertices exchange id's and store the lowest id until no vertex +receives a lower id. This algorithm is commonly applied in link:https://en.wikipedia.org/wiki/Bulk_synchronous_parallel[bulk synchronous parallel] systems, e.g. in -link:https://spark.apache.org/graphx[Apache Spark GraphX]. +link:https://spark.apache.org/graphx[Apache Spark GraphX]. Overhead for the vertex program arises because it has to run +as many cycles as the largest length of the shortest paths between any two vertices in a component of the graph. In +every cycle each vertex has to be checked for being +"halted". Overhead of the OLTP traversal consists of each traverser having to carry complete path information. For +pure depth-first-search or breadth-first-search implementations, connected-component algotithms should scale +as [.big]##O##(V+E). For the traversals in the figure above this is almost the case. + + +[[cc-scale-ratio]] +.Run times for finding connected components in a randomly generated graph with 10 components, each consisting of 6400 vertices +image::cc-scale-ratio.png[width=600] + +The random graphs used for the scalability tests can be modulated with the edge/vertex ratio. For small ratios the +components generated are more lint-like and harder to process by the `connectedComponent()`-step. For high ratios +the components are more mesh-like and the ConnectedComponentVertexProgram needs few cycles to process the graph. These +characteristics show clearly from the graph. Indeed, for a given number of vertices, the run time of the +`connectedComponent()`-step does not depend on the number of edges, but rather on the maximum shortest path length in +the graph. + + +==== Large graphs + +Large graphs in TinkerPop require distributed processing by `SparkGraphComputer` to get results in a reasonable time (OLAP +approach). This means that the graph must be available as `HadoopGraph` (third party TinkerPop implementations often +allow to make a graph available as an `HadoopGraph` by providing an Hadoop `InputFormat`). Running the +`connectedComponent()`-step on +an `HadoopGraph` works the same as for a small graph, provided that `SparkGraphComputer` is specified as the graph computer, +either with the `gremlin.hadoop.defaultGraphComputer` property or as part of the `withComputer()`-step. + +Scalability of the the `connectedComponent()`-step with `SparkGraphComputer` is high, but note that: + +* the graph should fit in the memory of the Spark cluster to allow the VertexProgram to run its cycles without spilling +intermediate results to disk and loosing most of the gains from the distributed processing -==== Scalability +* as discussed for small graphs, the BSP algorithm does not play well with graphs having a large shortest path between +any pair of vertices. Overcoming this limitation is still a +link:http://www.vldb.org/pvldb/vol7/p1821-yan.pdf[subject of academic research]. -ToDo: - - limits and run time regime 1 - - test of friendster graph regime 3 - - discuss: link:http://www.vldb.org/pvldb/vol7/p1821-yan.pdf[http://www.vldb.org/pvldb/vol7/p1821-yan.pdf] \ No newline at end of file http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/b0878227/docs/static/images/cc-scale-ratio.png ---------------------------------------------------------------------- diff --git a/docs/static/images/cc-scale-ratio.png b/docs/static/images/cc-scale-ratio.png new file mode 100644 index 0000000..33a842d Binary files /dev/null and b/docs/static/images/cc-scale-ratio.png differ http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/b0878227/docs/static/images/cc-scale-size.png ---------------------------------------------------------------------- diff --git a/docs/static/images/cc-scale-size.png b/docs/static/images/cc-scale-size.png new file mode 100644 index 0000000..2b08a89 Binary files /dev/null and b/docs/static/images/cc-scale-size.png differ