tinkerpop-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From spmalle...@apache.org
Subject [2/3] tinkerpop git commit: Extended the connected-components recipe
Date Fri, 15 Jun 2018 12:24:49 GMT
Extended the connected-components recipe

Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/b0878227
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/b0878227
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/b0878227

Branch: refs/heads/TINKERPOP-1967
Commit: b087822708707013f7f0cd3b5abaf6d0f574a72e
Parents: 28d4b02
Author: HadoopMarc <vtslab@xs4all.nl>
Authored: Sun Jun 10 15:17:17 2018 +0200
Committer: HadoopMarc <vtslab@xs4all.nl>
Committed: Wed Jun 13 21:43:54 2018 +0200

 docs/src/recipes/connected-components.asciidoc |  94 ++++++++++++--------
 docs/static/images/cc-scale-ratio.png          | Bin 0 -> 14393 bytes
 docs/static/images/cc-scale-size.png           | Bin 0 -> 12220 bytes
 3 files changed, 58 insertions(+), 36 deletions(-)

diff --git a/docs/src/recipes/connected-components.asciidoc b/docs/src/recipes/connected-components.asciidoc
index edbeec5..850c31f 100644
--- a/docs/src/recipes/connected-components.asciidoc
+++ b/docs/src/recipes/connected-components.asciidoc
@@ -31,11 +31,11 @@ Depending on the size of the graph, three solution regimes can be discriminated:
 1. Small graphs that fit in the memory of a single machine
-2. Medium graphs backed by storage for which a linear scan is still feasible. This regime
is left to third party
+2. Medium-sized graphs backed by storage for which an OLTP linear scan is still feasible.
This regime is left to third party
 TinkerPop implementations, since TinkerPop itself has no storage-backed reference implementations.
The idea is that
 component membership is stored in the graph, rather than in memory.
-3. Large graphs requiring an OLAP approach to yield results in a reasonable time.
+3. Large graphs requiring an approach with `HadoopGraph` and `SparkGraphComputer` to yield
results in a reasonable time.
 These regimes are discussed separately using the following graph with three weakly connected
@@ -55,16 +55,21 @@ g.addV().property(id, "A").as("a").
+==== Small graph traversals
-===== Small graphs
-Connected components in a small graph can be determined with both an OLTP traversal and the
+Connected components in a small graph can be determined with either an OLTP traversal or
the OLAP
 `connectedComponent()`-step. The `connectedComponent()`-step is available as of TinkerPop
3.4.0 and is
 described in more detail in the
+The traversal looks like:
+    group().by('gremlin.connectedComponentVertexProgram.component').
+    select(values).unfold()
 A straightforward way to detect the various subgraphs with an OLTP traversal is to do this:
 g.V().emit(cyclicPath().or().not(both())).                                    <1>
@@ -73,7 +78,6 @@ g.V().emit(cyclicPath().or().not(both())).                             
     by(path().unfold().dedup().fold()).                                       <4>
     select(values).unfold()                                                   <5>
 <1> The initial emit() step allows for output of isolated vertices, in addition to
the discovery of
 components as described in (2).
@@ -83,7 +87,7 @@ path.  Collection `'a'` is used to keep track of visited vertices, for both
 and new traversals resulting from the `g.V()` linear scan.
 <3> While `'a'` nicely keeps track of vertices already visited, the actual components
need to be extracted from the
-path information of surviving traversers. The `path().unfold().limit(1)` closure provides
the starting vertex
+path information. The `path().unfold().limit(1)` closure provides the starting vertex
 of surviving traversers, which can be used to group the components.
 <4> This clause collects the unique vertices from all paths with the same starting
vertex, thus from the same
@@ -91,39 +95,57 @@ weak component.
 <5> The values of the groupby map contain the lists of vertices making up the requested
-This algorithm completes in linear time with the number of vertices and edges, because a
traversal is started for each
-vertex and each edge with its associated out-vertex is visited exactly once.
-==== Large graphs
-Large graphs require an OLAP solution with a custom VertexProgram that can be run using a
graph implementation's
-GraphComputer, in particular `SparkGraphComputer` on a `HadoopGraph`. The OLAP solution also
runs faster for most
-medium-sized graphs, that is when these graph have a 'natural' structure with a limited maximum
path length.
+==== Small graph scalability
-The TinkerPop library of vertex programs contains the `WeakComponentsVertexProgram` which
can be run in the same
-way as the link:http://tinkerpop.apache.org/docs/x.y.z/reference/#peerpressurevertexprogram[PeerPressureVertexProgram]:
+The scalability of the OLTP traversal and the `connectedComponent()`-step for in-memory graphs
is shown in the figures
-result = g.getGraph().compute().
-    program(WeakComponentsVertexProgram.build().maxIterations(100).create()).
-    mapReduce(ClusterPopulationMapReduce.build().create()).
-    mapReduce(ClusterCountMapReduce.build().create()).
-    submit().get()
-gResult = result.graph().traversal()
+.Run times for finding connected components in a randomly generated graph with 10 components
of equal size and with an edge/vertex ratio of 6
+image::cc-scale-size.png[width=600, side=bottom]
-The vertex program has interconnected vertices exchange id's and store the lowest id until
no vertex receives a
-lower id. This algorithm is commonly applied in
+In general, the `connectedComponent()`-step is almost a factor two faster than the OLTP traversal.
Only, for very
+small graphs the overhead of running the ConnectedComponentVertexProgram is larger than that
of the OLTP traversal.
+The vertex program works by having interconnected vertices exchange id's and store the lowest
id until no vertex
+receives a lower id. This algorithm is commonly applied in
 link:https://en.wikipedia.org/wiki/Bulk_synchronous_parallel[bulk synchronous parallel] systems,
e.g. in
-link:https://spark.apache.org/graphx[Apache Spark GraphX].
+link:https://spark.apache.org/graphx[Apache Spark GraphX]. Overhead for the vertex program
arises because it has to run
+as many cycles as the largest length of the shortest paths between any two vertices in a
component of the graph. In
+every cycle each vertex has to be checked for being
+"halted". Overhead of the OLTP traversal consists of each traverser having to carry complete
path information. For
+pure depth-first-search or breadth-first-search implementations, connected-component algotithms
should scale
+as [.big]##O##(V+E). For the traversals in the figure above this is almost the case.
+.Run times for finding connected components in a randomly generated graph with 10 components,
each consisting of 6400 vertices
+The random graphs used for the scalability tests can be modulated with the edge/vertex ratio.
For small ratios the
+components generated are more lint-like and harder to process by the `connectedComponent()`-step.
For high ratios
+the components are more mesh-like and the ConnectedComponentVertexProgram needs few cycles
to process the graph. These
+characteristics show clearly from the graph. Indeed, for a given number of vertices, the
run time of the
+`connectedComponent()`-step does not depend on the number of edges, but rather on the maximum
shortest path length in
+the graph.
+==== Large graphs
+Large graphs in TinkerPop require distributed processing by `SparkGraphComputer` to get results
in a reasonable time (OLAP
+approach). This means that the graph must be available as `HadoopGraph` (third party TinkerPop
implementations often
+allow to make a graph available as an `HadoopGraph` by providing an Hadoop `InputFormat`).
Running the
+`connectedComponent()`-step on
+an `HadoopGraph` works the same as for a small graph, provided that `SparkGraphComputer`
is specified as the graph computer,
+either with the `gremlin.hadoop.defaultGraphComputer` property or as part of the `withComputer()`-step.
+Scalability of the the `connectedComponent()`-step with `SparkGraphComputer` is high, but
note that:
+* the graph should fit in the memory of the Spark cluster to allow the VertexProgram to run
its cycles without spilling
+intermediate results to disk and loosing most of the gains from the distributed processing
-==== Scalability
+* as discussed for small graphs, the BSP algorithm does not play well with graphs having
a large shortest path between
+any pair of vertices. Overcoming this limitation is still a
+link:http://www.vldb.org/pvldb/vol7/p1821-yan.pdf[subject of academic research].
- - limits and run time regime 1
- - test of friendster graph regime 3
- - discuss: link:http://www.vldb.org/pvldb/vol7/p1821-yan.pdf[http://www.vldb.org/pvldb/vol7/p1821-yan.pdf]
\ No newline at end of file

diff --git a/docs/static/images/cc-scale-ratio.png b/docs/static/images/cc-scale-ratio.png
new file mode 100644
index 0000000..33a842d
Binary files /dev/null and b/docs/static/images/cc-scale-ratio.png differ

diff --git a/docs/static/images/cc-scale-size.png b/docs/static/images/cc-scale-size.png
new file mode 100644
index 0000000..2b08a89
Binary files /dev/null and b/docs/static/images/cc-scale-size.png differ

View raw message