tinkerpop-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dkupp...@apache.org
Subject [3/5] incubator-tinkerpop git commit: splitted implementations.asciidoc into implementations-hadoop.asciidoc and implementations-neo4j.asciidoc
Date Thu, 18 Feb 2016 22:55:07 GMT
http://git-wip-us.apache.org/repos/asf/incubator-tinkerpop/blob/bbf5b3f4/docs/src/reference/implementations-neo4j.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/reference/implementations-neo4j.asciidoc b/docs/src/reference/implementations-neo4j.asciidoc
new file mode 100644
index 0000000..5602754
--- /dev/null
+++ b/docs/src/reference/implementations-neo4j.asciidoc
@@ -0,0 +1,921 @@
+////
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+////
+[[implementations]]
+Implementations
+===============
+
+image::gremlin-racecar.png[width=325]
+
+[[graph-system-provider-requirements]]
+Graph System Provider Requirements
+----------------------------------
+
+image:tinkerpop-enabled.png[width=140,float=left] At the core of TinkerPop3 is a Java8 API. The implementation of this
+core API and its validation via the `gremlin-test` suite is all that is required of a graph system provider wishing to
+provide a TinkerPop3-enabled graph engine. Once a graph system has a valid implementation, then all the applications
+provided by TinkerPop (e.g. Gremlin Console, Gremlin Server, etc.) and 3rd-party developers (e.g. Gremlin-Scala,
+Gremlin-JS, etc.) will integrate properly. Finally, please feel free to use the logo on the left to promote your
+TinkerPop3 implementation.
+
+Implementing Gremlin-Core
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The classes that a graph system provider should focus on implementing are itemized below. It is a good idea to study
+the <<tinkergraph-gremlin,TinkerGraph>> (in-memory OLTP and OLAP in `tinkergraph-gremlin`), <<neo4j-gremlin,Neo4jGraph>>
+(OTLP w/ transactions in `neo4j-gremlin`) and/or <<hadoop-gremlin,HadoopGraph>> (OLAP in `hadoop-gremlin`)
+implementations for ideas and patterns.
+
+. Online Transactional Processing Graph Systems (*OLTP*)
+ .. Structure API: `Graph`, `Element`, `Vertex`, `Edge`, `Property` and `Transaction` (if transactions are supported).
+ .. Process API: `TraversalStrategy` instances for optimizing Gremlin traversals to the provider's graph system (i.e. `TinkerGraphStepStrategy`).
+. Online Analytics Processing Graph Systems (*OLAP*)
+ .. Everything required of OTLP is required of OLAP (but not vice versa).
+ .. GraphComputer API: `GraphComputer`, `Messenger`, `Memory`.
+
+Please consider the following implementation notes:
+
+* Be sure your `Graph` implementation is named as `XXXGraph` (e.g. TinkerGraph, Neo4jGraph, HadoopGraph, etc.).
+* Use `StringHelper` to ensuring that the `toString()` representation of classes are consistent with other implementations.
+* Ensure that your implementation's `Features` (Graph, Vertex, etc.) are correct so that test cases handle particulars accordingly.
+* Use the numerous static method helper classes such as `ElementHelper`, `GraphComputerHelper`, `VertexProgramHelper`, etc.
+* There are a number of default methods on the provided interfaces that are semantically correct. However, if they are
+not efficient for the implementation, override them.
+* Implement the `structure/` package interfaces first and then, if desired, interfaces in the `process/` package interfaces.
+* `ComputerGraph` is a `Wrapper` system that ensure proper semantics during a GraphComputer computation.
+
+[[oltp-implementations]]
+OLTP Implementations
+^^^^^^^^^^^^^^^^^^^^
+
+image:pipes-character-1.png[width=110,float=right] The most important interfaces to implement are in the `structure/`
+package. These include interfaces like Graph, Vertex, Edge, Property, Transaction, etc. The `StructureStandardSuite`
+will ensure that the semantics of the methods implemented are correct. Moreover, there are numerous `Exceptions`
+classes with static exceptions that should be thrown by the graph system so that all the exceptions and their
+messages are consistent amongst all TinkerPop3 implementations.
+
+[[olap-implementations]]
+OLAP Implementations
+^^^^^^^^^^^^^^^^^^^^
+
+image:furnace-character-1.png[width=110,float=right] Implementing the OLAP interfaces may be a bit more complicated.
+Note that before OLAP interfaces are implemented, it is necessary for the OLTP interfaces to be, at minimal,
+implemented as specified in <<oltp-implementations,OLTP Implementations>>. A summary of each required interface
+implementation is presented below:
+
+. `GraphComputer`: A fluent builder for specifying an isolation level, a VertexProgram, and any number of MapReduce jobs to be submitted.
+. `Memory`: A global blackboard for ANDing, ORing, INCRing, and SETing values for specified keys.
+. `Messenger`: The system that collects and distributes messages being propagated by vertices executing the VertexProgram application.
+. `MapReduce.MapEmitter`: The system that collects key/value pairs being emitted by the MapReduce applications map-phase.
+. `MapReduce.ReduceEmitter`: The system that collects key/value pairs being emitted by the MapReduce applications combine- and reduce-phases.
+
+NOTE: The VertexProgram and MapReduce interfaces in the `process/computer/` package are not required by the graph
+system. Instead, these are interfaces to be implemented by application developers writing VertexPrograms and MapReduce jobs.
+
+IMPORTANT: TinkerPop3 provides three OLAP implementations: <<tinkergraph-gremlin,TinkerGraphComputer>> (TinkerGraph),
+<<giraphgraphcomputer,GiraphGraphComputer>> (HadoopGraph), and <<sparkgraphcomputer,`SparkGraphComputer`>> (Hadoop).
+Given the complexity of the OLAP system, it is good to study and copy many of the patterns used in these reference
+implementations.
+
+Implementing GraphComputer
+++++++++++++++++++++++++++
+
+image:furnace-character-3.png[width=150,float=right] The most complex method in GraphComputer is the `submit()`-method. The method must do the following:
+
+. Ensure the the GraphComputer has not already been executed.
+. Ensure that at least there is a VertexProgram or 1 MapReduce job.
+. If there is a VertexProgram, validate that it can execute on the GraphComputer given the respectively defined features.
+. Create the Memory to be used for the computation.
+. Execute the VertexProgram.setup() method once and only once.
+. Execute the VertexProgram.execute() method for each vertex.
+. Execute the VertexProgram.terminate() method once and if true, repeat VertexProgram.execute().
+. When VertexProgram.terminate() returns true, move to MapReduce job execution.
+. MapReduce jobs are not required to be executed in any specified order.
+. For each Vertex, execute MapReduce.map(). Then (if defined) execute MapReduce.combine() and MapReduce.reduce().
+. Update Memory with runtime information.
+. Construct a new `ComputerResult` containing the compute Graph and Memory.
+
+Implementing Memory
++++++++++++++++++++
+
+image:gremlin-brain.png[width=175,float=left] The Memory object is initially defined by `VertexProgram.setup()`.
+The memory data is available in the first round of the `VertexProgram.execute()` method. Each Vertex, when executing
+the VertexProgram, can update the Memory in its round. However, the update is not seen by the other vertices until
+the next round. At the end of the first round, all the updates are aggregated and the new memory data is available
+on the second round. This process repeats until the VertexProgram terminates.
+
+Implementing Messenger
+++++++++++++++++++++++
+
+The Messenger object is similar to the Memory object in that a vertex can read and write to the Messenger. However,
+the data it reads are the messages sent to the vertex in the previous step and the data it writes are the messages
+that will be readable by the receiving vertices in the subsequent round.
+
+Implementing MapReduce Emitters
++++++++++++++++++++++++++++++++
+
+image:hadoop-logo-notext.png[width=150,float=left] The MapReduce framework in TinkerPop3 is similar to the model
+popularized by link:http://apache.hadoop.org[Hadoop]. The primary difference is that all Mappers process the vertices
+of the graph, not an arbitrary key/value pair. However, the vertices' edges can not be accessed -- only their
+properties. This greatly reduces the amount of data needed to be pushed through the MapReduce engine as any edge
+information required, can be computed in the VertexProgram.execute() method. Moreover, at this stage, vertices can
+not be mutated, only their token and property data read. A Gremlin OLAP system needs to provide implementations for
+to particular classes: `MapReduce.MapEmitter` and `MapReduce.ReduceEmitter`. TinkerGraph's implementation is provided
+below which demonstrates the simplicity of the algorithm (especially when the data is all within the same JVM).
+
+[source,java]
+----
+public class TinkerMapEmitter<K, V> implements MapReduce.MapEmitter<K, V> {
+
+    public Map<K, Queue<V>> reduceMap;
+    public Queue<KeyValue<K, V>> mapQueue;
+    private final boolean doReduce;
+
+    public TinkerMapEmitter(final boolean doReduce) { <1>
+        this.doReduce = doReduce;
+        if (this.doReduce)
+            this.reduceMap = new ConcurrentHashMap<>();
+        else
+            this.mapQueue = new ConcurrentLinkedQueue<>();
+    }
+
+    @Override
+    public void emit(K key, V value) {
+        if (this.doReduce)
+            this.reduceMap.computeIfAbsent(key, k -> new ConcurrentLinkedQueue<>()).add(value); <2>
+        else
+            this.mapQueue.add(new KeyValue<>(key, value)); <3>
+    }
+
+    protected void complete(final MapReduce<K, V, ?, ?, ?> mapReduce) {
+        if (!this.doReduce && mapReduce.getMapKeySort().isPresent()) { <4>
+            final Comparator<K> comparator = mapReduce.getMapKeySort().get();
+            final List<KeyValue<K, V>> list = new ArrayList<>(this.mapQueue);
+            Collections.sort(list, Comparator.comparing(KeyValue::getKey, comparator));
+            this.mapQueue.clear();
+            this.mapQueue.addAll(list);
+        } else if (mapReduce.getMapKeySort().isPresent()) {
+            final Comparator<K> comparator = mapReduce.getMapKeySort().get();
+            final List<Map.Entry<K, Queue<V>>> list = new ArrayList<>();
+            list.addAll(this.reduceMap.entrySet());
+            Collections.sort(list, Comparator.comparing(Map.Entry::getKey, comparator));
+            this.reduceMap = new LinkedHashMap<>();
+            list.forEach(entry -> this.reduceMap.put(entry.getKey(), entry.getValue()));
+        }
+    }
+}
+----
+
+<1> If the MapReduce job has a reduce, then use one data structure (`reduceMap`), else use another (`mapList`). The
+difference being that a reduction requires a grouping by key and therefore, the `Map<K,Queue<V>>` definition. If no
+reduction/grouping is required, then a simple `Queue<KeyValue<K,V>>` can be leveraged.
+<2> If reduce is to follow, then increment the Map with a new value for the key. `MapHelper` is a TinkerPop3 class
+with static methods for adding data to a Map.
+<3> If no reduce is to follow, then simply append a KeyValue to the queue.
+<4> When the map phase is complete, any map-result sorting required can be executed at this point.
+
+[source,java]
+----
+public class TinkerReduceEmitter<OK, OV> implements MapReduce.ReduceEmitter<OK, OV> {
+
+    protected Queue<KeyValue<OK, OV>> reduceQueue = new ConcurrentLinkedQueue<>();
+
+    @Override
+    public void emit(final OK key, final OV value) {
+        this.reduceQueue.add(new KeyValue<>(key, value));
+    }
+
+    protected void complete(final MapReduce<?, ?, OK, OV, ?> mapReduce) {
+        if (mapReduce.getReduceKeySort().isPresent()) {
+            final Comparator<OK> comparator = mapReduce.getReduceKeySort().get();
+            final List<KeyValue<OK, OV>> list = new ArrayList<>(this.reduceQueue);
+            Collections.sort(list, Comparator.comparing(KeyValue::getKey, comparator));
+            this.reduceQueue.clear();
+            this.reduceQueue.addAll(list);
+        }
+    }
+}
+----
+
+The method `MapReduce.reduce()` is defined as:
+
+[source,java]
+public void reduce(final OK key, final Iterator<OV> values, final ReduceEmitter<OK, OV> emitter) { ... }
+
+In other words, for the TinkerGraph implementation, iterate through the entrySet of the `reduceMap` and call the
+`reduce()` method on each entry. The `reduce()` method can emit key/value pairs which are simply aggregated into a
+`Queue<KeyValue<OK,OV>>` in an analogous fashion to `TinkerMapEmitter` when no reduce is to follow. These two emitters
+are tied together in `TinkerGraphComputer.submit()`.
+
+[source,java]
+----
+...
+for (final MapReduce mapReduce : mapReducers) {
+    if (mapReduce.doStage(MapReduce.Stage.MAP)) {
+        final TinkerMapEmitter<?, ?> mapEmitter = new TinkerMapEmitter<>(mapReduce.doStage(MapReduce.Stage.REDUCE));
+        final SynchronizedIterator<Vertex> vertices = new SynchronizedIterator<>(this.graph.vertices());
+        workers.setMapReduce(mapReduce);
+        workers.mapReduceWorkerStart(MapReduce.Stage.MAP);
+        workers.executeMapReduce(workerMapReduce -> {
+            while (true) {
+                final Vertex vertex = vertices.next();
+                if (null == vertex) return;
+                workerMapReduce.map(ComputerGraph.mapReduce(vertex), mapEmitter);
+            }
+        });
+        workers.mapReduceWorkerEnd(MapReduce.Stage.MAP);
+
+        // sort results if a map output sort is defined
+        mapEmitter.complete(mapReduce);
+
+        // no need to run combiners as this is single machine
+        if (mapReduce.doStage(MapReduce.Stage.REDUCE)) {
+            final TinkerReduceEmitter<?, ?> reduceEmitter = new TinkerReduceEmitter<>();
+            final SynchronizedIterator<Map.Entry<?, Queue<?>>> keyValues = new SynchronizedIterator((Iterator) mapEmitter.reduceMap.entrySet().iterator());
+            workers.mapReduceWorkerStart(MapReduce.Stage.REDUCE);
+            workers.executeMapReduce(workerMapReduce -> {
+                while (true) {
+                    final Map.Entry<?, Queue<?>> entry = keyValues.next();
+                    if (null == entry) return;
+                        workerMapReduce.reduce(entry.getKey(), entry.getValue().iterator(), reduceEmitter);
+                    }
+                });
+            workers.mapReduceWorkerEnd(MapReduce.Stage.REDUCE);
+            reduceEmitter.complete(mapReduce); // sort results if a reduce output sort is defined
+            mapReduce.addResultToMemory(this.memory, reduceEmitter.reduceQueue.iterator()); <1>
+        } else {
+            mapReduce.addResultToMemory(this.memory, mapEmitter.mapQueue.iterator()); <2>
+        }
+    }
+}
+...
+----
+
+<1> Note that the final results of the reducer are provided to the Memory as specified by the application developer's
+`MapReduce.addResultToMemory()` implementation.
+<2> If there is no reduce stage, the the map-stage results are inserted into Memory as specified by the application
+developer's `MapReduce.addResultToMemory()` implementation.
+
+[[io-implementations]]
+IO Implementations
+^^^^^^^^^^^^^^^^^^
+
+If a `Graph` requires custom serializers for IO to work properly, implement the `Graph.io` method.  A typical example
+of where a `Graph` would require such a custom serializers is if their identifier system uses non-primitive values,
+such as OrientDB's `Rid` class.  From basic serialization of a single `Vertex` all the way up the stack to Gremlin
+Server, the need to know how to handle these complex identifiers is an important requirement.
+
+The first step to implementing custom serializers is to first implement the `IoRegistry` interface and register the
+custom classes and serializers to it. Each `Io` implementation has different requirements for what it expects from the
+`IoRegistry`:
+
+* *GraphML* - No custom serializers expected/allowed.
+* *GraphSON* - Register a Jackson `SimpleModule`.  The `SimpleModule` encapsulates specific classes to be serialized,
+so it does not need to be registered to a specific class in the `IoRegistry` (use `null`).
+* *Gryo* - Expects registration of one of three objects:
+** Register just the custom class with a `null` Kryo `Serializer` implementation - this class will use default "field-level" Kryo serialization.
+** Register the custom class with a specific Kryo `Serializer' implementation.
+** Register the custom class with a `Function<Kryo, Serializer>` for those cases where the Kryo `Serializer` requires the `Kryo` instance to get constructed.
+
+This implementation should provide a zero-arg constructor as the stack may require instantiation via reflection.
+Consider extending `AbstractIoRegistry` for convenience as follows:
+
+[source,java]
+----
+public class MyGraphIoRegistry extends AbstractIoRegistry {
+    public MyGraphIoRegistry() {
+        register(GraphSONIo.class, null, new MyGraphSimpleModule());
+        register(GryoIo.class, MyGraphIdClass.class, new MyGraphIdSerializer());
+    }
+}
+----
+
+In the `Graph.io` method, provide the `IoRegistry` object to the supplied `Builder` and call the `create` method to
+return that `Io` instance as follows:
+
+[source,java]
+----
+public <I extends Io> I io(final Io.Builder<I> builder) {
+    return (I) builder.graph(this).registry(myGraphIoRegistry).create();
+}}
+----
+
+In this way, `Graph` implementations can pre-configure custom serializers for IO interactions and users will not need
+to know about those details. Following this pattern will ensure proper execution of the test suite as well as
+simplified usage for end-users.
+
+IMPORTANT: Proper implementation of IO is critical to successful `Graph` operations in Gremlin Server.  The Test Suite
+does have "serialization" tests that provide some assurance that an implementation is working properly, but those
+tests cannot make assertions against any specifics of a custom serializer.  It is the responsibility of the
+implementer to test the specifics of their custom serializers.
+
+TIP: Consider separating serializer code into its own module, if possible, so that clients that use the `Graph`
+implementation remotely don't need a full dependency on the entire `Graph` - just the IO components and related
+classes being serialized.
+
+[[validating-with-gremlin-test]]
+Validating with Gremlin-Test
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+image:gremlin-edumacated.png[width=225]
+
+[source,xml]
+<dependency>
+  <groupId>org.apache.tinkerpop</groupId>
+  <artifactId>gremlin-test</artifactId>
+  <version>x.y.z</version>
+</dependency>
+<dependency>
+  <groupId>org.apache.tinkerpop</groupId>
+  <artifactId>gremlin-groovy-test</artifactId>
+  <version>x.y.z</version>
+</dependency>
+
+The operational semantics of any OLTP or OLAP implementation are validated by `gremlin-test` and functional
+interoperability with the Groovy environment is ensured by `gremlin-groovy-test`. To implement these tests, provide
+test case implementations as shown below, where `XXX` below denotes the name of the graph implementation (e.g.
+TinkerGraph, Neo4jGraph, HadoopGraph, etc.).
+
+[source,java]
+----
+// Structure API tests
+@RunWith(StructureStandardSuite.class)
+@GraphProviderClass(provider = XXXGraphProvider.class, graph = XXXGraph.class)
+public class XXXStructureStandardTest {}
+
+// Process API tests
+@RunWith(ProcessComputerSuite.class)
+@GraphProviderClass(provider = XXXGraphProvider.class, graph = XXXGraph.class)
+public class XXXProcessComputerTest {}
+
+@RunWith(ProcessStandardSuite.class)
+@GraphProviderClass(provider = XXXGraphProvider.class, graph = XXXGraph.class)
+public class XXXProcessStandardTest {}
+
+@RunWith(GroovyEnvironmentSuite.class)
+@GraphProviderClass(provider = XXXProvider.class, graph = TinkerGraph.class)
+public class XXXGroovyEnvironmentTest {}
+
+@RunWith(GroovyProcessStandardSuite.class)
+@GraphProviderClass(provider = XXXGraphProvider.class, graph = TinkerGraph.class)
+public class XXXGroovyProcessStandardTest {}
+
+@RunWith(GroovyProcessComputerSuite.class)
+@GraphProviderClass(provider = XXXGraphComputerProvider.class, graph = TinkerGraph.class)
+public class XXXGroovyProcessComputerTest {}
+----
+
+The above set of tests represent the minimum test suite set to implement.  There are other "integration" and
+"performance" tests that should be considered optional.  Implementing those tests requires the same pattern as shown above.
+
+IMPORTANT: It is as important to look at "ignored" tests as it is to look at ones that fail.  The `gremlin-test`
+suite utilizes the `Feature` implementation exposed by the `Graph` to determine which tests to execute.  If a test
+utilizes features that are not supported by the graph, it will ignore them.  While that may be fine, implementers
+should validate that the ignored tests are appropriately bypassed and that there are no mistakes in their feature
+definitions.  Moreover, implementers should consider filling gaps in their own test suites, especially when
+IO-related tests are being ignored.
+
+The only test-class that requires any code investment is the `GraphProvider` implementation class. This class is a
+used by the test suite to construct `Graph` configurations and instances and provides information about the
+implementation itself.  In most cases, it is best to simply extend `AbstractGraphProvider` as it provides many
+default implementations of the `GraphProvider` interface.
+
+Finally, specify the test suites that will be supported by the `Graph` implementation using the `@Graph.OptIn`
+annotation.  See the `TinkerGraph` implementation below as an example:
+
+[source,java]
+----
+@Graph.OptIn(Graph.OptIn.SUITE_STRUCTURE_STANDARD)
+@Graph.OptIn(Graph.OptIn.SUITE_PROCESS_STANDARD)
+@Graph.OptIn(Graph.OptIn.SUITE_PROCESS_COMPUTER)
+@Graph.OptIn(Graph.OptIn.SUITE_GROOVY_PROCESS_STANDARD)
+@Graph.OptIn(Graph.OptIn.SUITE_GROOVY_PROCESS_COMPUTER)
+@Graph.OptIn(Graph.OptIn.SUITE_GROOVY_ENVIRONMENT)
+public class TinkerGraph implements Graph {
+----
+
+Only include annotations for the suites the implementation will support.  Note that implementing the suite, but
+not specifying the appropriate annotation will prevent the suite from running (an obvious error message will appear
+in this case when running the mis-configured suite).
+
+There are times when there may be a specific test in the suite that the implementation cannot support (despite the
+features it implements) or should not otherwise be executed.  It is possible for implementers to "opt-out" of a test
+by using the `@Graph.OptOut` annotation.  The following is an example of this annotation usage as taken from
+`HadoopGraph`:
+
+[source, java]
+----
+@Graph.OptIn(Graph.OptIn.SUITE_PROCESS_STANDARD)
+@Graph.OptIn(Graph.OptIn.SUITE_PROCESS_COMPUTER)
+@Graph.OptOut(
+        test = "org.apache.tinkerpop.gremlin.process.graph.step.map.MatchTest$Traversals",
+        method = "g_V_matchXa_hasXname_GarciaX__a_inXwrittenByX_b__a_inXsungByX_bX",
+        reason = "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute.")
+@Graph.OptOut(
+        test = "org.apache.tinkerpop.gremlin.process.graph.step.map.MatchTest$Traversals",
+        method = "g_V_matchXa_inXsungByX_b__a_inXsungByX_c__b_outXwrittenByX_d__c_outXwrittenByX_e__d_hasXname_George_HarisonX__e_hasXname_Bob_MarleyXX",
+        reason = "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute.")
+@Graph.OptOut(
+        test = "org.apache.tinkerpop.gremlin.process.computer.GraphComputerTest",
+        method = "shouldNotAllowBadMemoryKeys",
+        reason = "Hadoop does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though.")
+@Graph.OptOut(
+        test = "org.apache.tinkerpop.gremlin.process.computer.GraphComputerTest",
+        method = "shouldRequireRegisteringMemoryKeys",
+        reason = "Hadoop does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though.")
+public class HadoopGraph implements Graph {
+----
+
+The above examples show how to ignore individual tests.  It is also possible to:
+
+* Ignore an entire test case (i.e. all the methods within the test) by setting the `method` to "*".
+* Ignore a "base" test class such that test that extend from those classes will all be ignored.  This style of
+ignoring is useful for Gremlin "process" tests that have bases classes that are extended by various Gremlin flavors (e.g. groovy).
+* Ignore a `GraphComputer` test based on the type of `GraphComputer` being used.  Specify the "computer" attribute on
+the `OptOut` (which is an array specification) which should have a value of the `GraphComputer` implementation class
+that should ignore that test. This attribute should be left empty for "standard" execution and by default all
+`GraphComputer` implementations will be included in the `OptOut` so if there are multiple implementations, explicitly
+specify the ones that should be excluded.
+
+Also note that some of the tests in the Gremlin Test Suite are parameterized tests and require an additional level of
+specificity to be properly ignored.  To ignore these types of tests, examine the name template of the parameterized
+tests.  It is defined by a Java annotation that looks like this:
+
+[source, java]
+@Parameterized.Parameters(name = "expect({0})")
+
+The annotation above shows that the name of each parameterized test will be prefixed with "expect" and have
+parentheses wrapped around the first parameter (at index 0) value supplied to each test.  This information can
+only be garnered by studying the test set up itself.  Once the pattern is determined and the specific unique name of
+the parameterized test is identified, add it to the `specific` property on the `OptOut` annotation in addition to
+the other arguments.
+
+These annotations help provide users a level of transparency into test suite compliance (via the
+xref:describe-graph[describeGraph()] utility function). It also allows implementers to have a lot of flexibility in
+terms of how they wish to support TinkerPop.  For example, maybe there is a single test case that prevents an
+implementer from claiming support of a `Feature`.  The implementer could choose to either not support the `Feature`
+or to support it but "opt-out" of the test with a "reason" as to why so that users understand the limitation.
+
+IMPORTANT: Before using `OptOut` be sure that the reason for using it is sound and it is more of a last resort.
+It is possible that a test from the suite doesn't properly represent the expectations of a feature, is too broad or
+narrow for the semantics it is trying to enforce or simply contains a bug.  Please consider raising issues in the
+developer mailing list with such concerns before assuming `OptOut` is the only answer.
+
+IMPORTANT: There are no tests that specifically validate complete compliance with Gremlin Server.  Generally speaking,
+a `Graph` that passes the full Test Suite, should be compliant with Gremlin Server.  The one area where problems can
+occur is in serialization.  Always ensure that IO is properly implemented, that custom serializers are tested fully
+and ultimately integration test the `Graph` with an actual Gremlin Server instance.
+
+CAUTION: Configuring tests to run in parallel might result in errors that are difficult to debug as there is some
+shared state in test execution around graph configuration.  It is therefore recommended that parallelism be turned
+off for the test suite (the Maven SureFire Plugin is configured this way by default).  It may also be important to
+include this setting, `<reuseForks>false</reuseForks>`, in the SureFire configuration if tests are failing in an
+unexplainable way.
+
+Accessibility via GremlinPlugin
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+image:gremlin-plugin.png[width=100,float=left] The applications distributed with TinkerPop3 do not distribute with
+any graph system implementations besides TinkerGraph. If your implementation is stored in a Maven repository (e.g.
+Maven Central Repository), then it is best to provide a `GremlinPlugin` implementation so the respective jars can be
+downloaded according and when required by the user. Neo4j's GremlinPlugin is provided below for reference.
+
+[source,java]
+----
+public class Neo4jGremlinPlugin implements GremlinPlugin {
+
+    private static final String IMPORT = "import ";
+    private static final String DOT_STAR = ".*";
+
+    private static final Set<String> IMPORTS = new HashSet<String>() {{
+        add(IMPORT + Neo4jGraph.class.getPackage().getName() + DOT_STAR);
+    }};
+
+    @Override
+    public String getName() {
+        return "neo4j";
+    }
+
+    @Override
+    public void pluginTo(final PluginAcceptor pluginAcceptor) {
+        pluginAcceptor.addImports(IMPORTS);
+    }
+}
+---- 
+
+With the above plugin implementations, users can now download respective binaries for Gremlin Console, Gremlin Server, etc.
+
+[source,groovy]
+gremlin> g = Neo4jGraph.open('/tmp/neo4j')
+No such property: Neo4jGraph for class: groovysh_evaluate
+Display stack trace? [yN]
+gremlin> :install org.apache.tinkerpop neo4j-gremlin x.y.z
+==>loaded: [org.apache.tinkerpop, neo4j-gremlin, …]
+gremlin> :plugin use tinkerpop.neo4j
+==>tinkerpop.neo4j activated
+gremlin> g = Neo4jGraph.open('/tmp/neo4j')
+==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]]
+
+In-Depth Implementations
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+image:gremlin-painting.png[width=200,float=right] The graph system implementation details presented thus far are
+minimum requirements necessary to yield a valid TinkerPop3 implementation. However, there are other areas that a
+graph system provider can tweak to provide an implementation more optimized for their underlying graph engine. Typical
+areas of focus include:
+
+* Traversal Strategies: A <<traversalstrategy,TraversalStrategy>> can be used to alter a traversal prior to its
+execution. A typical example is converting a pattern of `g.V().has('name','marko')` into a global index lookup for
+all vertices with name "marko". In this way, a `O(|V|)` lookup becomes an `O(log(|V|))`. Please review
+`TinkerGraphStepStrategy` for ideas.
+* Step Implementations: Every <<graph-traversal-steps,step>> is ultimately referenced by the `GraphTraversal`
+interface. It is possible to extend `GraphTraversal` to use a graph system specific step implementation.
+
+
+[[tinkergraph-gremlin]]
+TinkerGraph-Gremlin
+-------------------
+
+[source,xml]
+----
+<dependency>
+   <groupId>org.apache.tinkerpop</groupId>
+   <artifactId>tinkergraph-gremlin</artifactId>
+   <version>x.y.z</version>
+</dependency>
+----
+
+image:tinkerpop-character.png[width=100,float=left] TinkerGraph is a single machine, in-memory (with optional
+persistence), non-transactional graph engine that provides both OLTP and OLAP functionality. It is deployed with
+TinkerPop3 and serves as the reference implementation for other providers to study in order to understand the
+semantics of the various methods of the TinkerPop3 API. Constructing a simple graph in Java8 is presented below.
+
+[source,java]
+Graph g = TinkerGraph.open();
+Vertex marko = g.addVertex("name","marko","age",29);
+Vertex lop = g.addVertex("name","lop","lang","java");
+marko.addEdge("created",lop,"weight",0.6d);
+
+The above graph creates two vertices named "marko" and "lop" and connects them via a created-edge with a weight=0.6
+property. Next, the graph can be queried as such.
+
+[source,java]
+g.V().has("name","marko").out("created").values("name")
+
+The `g.V().has("name","marko")` part of the query can be executed in two ways.
+
+ * A linear scan of all vertices filtering out those vertices that don't have the name "marko"
+ * A `O(log(|V|))` index lookup for all vertices with the name "marko"
+
+Given the initial graph construction in the first code block, no index was defined and thus, a linear scan is executed.
+However, if the graph was constructed as such, then an index lookup would be used.
+
+[source,java]
+Graph g = TinkerGraph.open();
+g.createIndex("name",Vertex.class)
+
+The execution times for a vertex lookup by property is provided below for both no-index and indexed version of
+TinkerGraph over the Grateful Dead graph.
+
+[gremlin-groovy]
+----
+graph = TinkerGraph.open()
+g = graph.traversal()
+graph.io(graphml()).readGraph('data/grateful-dead.xml')
+clock(1000) {g.V().has('name','Garcia').iterate()} <1>
+graph = TinkerGraph.open()
+g = graph.traversal()
+graph.createIndex('name',Vertex.class)
+graph.io(graphml()).readGraph('data/grateful-dead.xml')
+clock(1000){g.V().has('name','Garcia').iterate()} <2>
+----
+
+<1> Determine the average runtime of 1000 vertex lookups when no `name`-index is defined.
+<2> Determine the average runtime of 1000 vertex lookups when a `name`-index is defined.
+
+IMPORTANT: Each graph system will have different mechanism by which indices and schemas are defined. TinkerPop3
+does not require any conformance in this area. In TinkerGraph, the only definitions are around indices. With other
+graph systems, property value types, indices, edge labels, etc. may be required to be defined _a priori_ to adding
+data to the graph.
+
+NOTE: TinkerGraph is distributed with Gremlin Server and is therefore automatically available to it for configuration.
+
+Configuration
+~~~~~~~~~~~~~
+
+TinkerGraph has several settings that can be provided on creation via `Configuration` object:
+
+[width="100%",cols="2,10",options="header"]
+|=========================================================
+|Property |Description
+|gremlin.graph |`org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerGraph`
+|gremlin.tinkergraph.vertexIdManager |The `IdManager` implementation to use for vertices.
+|gremlin.tinkergraph.edgeIdManager |The `IdManager` implementation to use for edges.
+|gremlin.tinkergraph.vertexPropertyIdManager |The `IdManager` implementation to use for vertex properties.
+|gremlin.tinkergraph.defaultVertexPropertyCardinality |The default `VertexProperty.Cardinality` to use when `Vertex.property(k,v)` is called.
+|gremlin.tinkergraph.graphLocation |The path and file name for where TinkerGraph should persist the graph data. If a
+value is specified here, the the `gremlin.tinkergraph.graphFormat` should also be specified.  If this value is not
+included (default), then the graph will stay in-memory and not be loaded/persisted to disk.
+|gremlin.tinkergraph.graphFormat |The format to use to serialize the graph which may be one of the following:
+`graphml`, `graphson`, `gryo`, or a fully qualified class name that implements Io.Builder interface (which allows for
+external third party graph reader/writer formats to be used for persistence).
+If a value is specified here, then the `gremlin.tinkergraph.graphLocation` should
+also be specified.  If this value is not included (default), then the graph will stay in-memory and not be
+loaded/persisted to disk.
+|=========================================================
+
+The `IdManager` settings above refer to how TinkerGraph will control identifiers for vertices, edges and vertex
+properties.  There are several options for each of these settings: `ANY`, `LONG`, `INTEGER`, `UUID`, or the fully
+qualified class name of an `IdManager` implementation on the classpath.  When not specified, the default values
+for all settings is `ANY`, meaning that the graph will work with any object on the JVM as the identifier and will
+generate new identifiers from `Long` when the identifier is not user supplied.  TinkerGraph will also expect the
+user to understand the types used for identifiers when querying, meaning that `g.V(1)` and `g.V(1L)` could return
+two different vertices.  `LONG`, `INTEGER` and `UUID` settings will try to coerce identifier values to the expected
+type as well as generate new identifiers with that specified type.
+
+If the TinkerGraph is configured for persistence with `gremlin.tinkergraph.graphLocation` and
+`gremlin.tinkergraph.graphFormat`, then the graph will be written to the specified location with the specified
+format when `Graph.close()` is called.  In addition, if these settings are present, TinkerGraph will attempt to
+load the graph from the specified location.
+
+IMPORTANT: If choosing `graphson` as the `gremlin.tinkergraph.graphFormat`, be sure to also establish the  various
+`IdManager` settings as well to ensure that identifiers are properly coerced to the appropriate types as GraphSON
+can lose the identifier's type during serialization (i.e. it will assume `Integer` when the default for TinkerGraph
+is `Long`, which could lead to load errors that result in a message like, "Vertex with id already exists").
+
+It is important to consider the data being imported to TinkerGraph with respect to `defaultVertexPropertyCardinality`
+setting.  For example, if a `.gryo` file is known to contain multi-property data, be sure to set the default
+cardinality to `list` or else the data will import as `single`.  Consider the following:
+
+[gremlin-groovy]
+----
+graph = TinkerGraph.open()
+graph.io(gryo()).readGraph("data/tinkerpop-crew.kryo")
+g = graph.traversal()
+g.V().properties()
+conf = new BaseConfiguration()
+conf.setProperty("gremlin.tinkergraph.defaultVertexPropertyCardinality","list")
+graph = TinkerGraph.open(conf)
+graph.io(gryo()).readGraph("data/tinkerpop-crew.kryo")
+g = graph.traversal()
+g.V().properties()
+----
+
+[[neo4j-gremlin]]
+Neo4j-Gremlin
+-------------
+
+[source,xml]
+----
+<dependency>
+   <groupId>org.apache.tinkerpop</groupId>
+   <artifactId>neo4j-gremlin</artifactId>
+   <version>x.y.z</version>
+</dependency>
+<!-- neo4j-tinkerpop-api-impl is NOT Apache 2 licensed - more information below -->
+<dependency>
+  <groupId>org.neo4j</groupId>
+  <artifactId>neo4j-tinkerpop-api-impl</artifactId>
+  <version>0.1-2.2</version>
+</dependency>
+----
+
+link:http://neotechnology.com[Neo Technology] are the developers of the OLTP-based link:http://neo4j.org[Neo4j graph database].
+
+CAUTION: Unless under a commercial agreement with Neo Technology, Neo4j is licensed
+link:http://en.wikipedia.org/wiki/Affero_General_Public_License[AGPL]. The `neo4j-gremlin` module is licensed Apache2
+because it only references the Apache2-licensed Neo4j API (not its implementation). Note that neither the
+<<gremlin-console,Gremlin Console>> nor <<gremlin-server,Gremlin Server>> distribute with the Neo4j implementation
+binaries. To access the binaries, use the `:install` command to download binaries from
+link:http://search.maven.org/[Maven Central Repository].
+
+[source,groovy]
+----
+gremlin> :install org.apache.tinkerpop neo4j-gremlin x.y.z
+==>Loaded: [org.apache.tinkerpop, neo4j-gremlin, x.y.z] - restart the console to use [tinkerpop.neo4j]
+gremlin> :q
+...
+gremlin> :plugin use tinkerpop.neo4j
+==>tinkerpop.neo4j activated
+gremlin> graph = Neo4jGraph.open('/tmp/neo4j')
+==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]]
+----
+
+NOTE: Neo4j link:http://docs.neo4j.org/chunked/stable/ha.html[High Availability] is currently not supported by
+Neo4j-Gremlin.
+
+TIP: To host Neo4j in <<gremlin-server,Gremlin Server>>, the dependencies must first be "installed" or otherwise
+copied to the Gremlin Server path. The automated method for doing this would be to execute
+`bin/gremlin-server.sh -i org.apache.tinkerpop neo4j-gremlin x.y.z`.
+
+Indices
+~~~~~~~
+
+Neo4j 2.x indices leverage vertex labels to partition the index space. TinkerPop3 does not provide method interfaces
+for defining schemas/indices for the underlying graph system. Thus, in order to create indices, it is important to
+call the Neo4j API directly.
+
+NOTE: `Neo4jGraphStep` will attempt to discern which indices to use when executing a traversal of the form `g.V().has()`.
+
+The Gremlin-Console session below demonstrates Neo4j indices. For more information, please refer to the Neo4j documentation:
+
+* Manipulating indices with link:http://docs.neo4j.org/chunked/stable/query-schema-index.html[Cypher].
+* Manipulating indices with the Neo4j link:http://docs.neo4j.org/chunked/stable/tutorials-java-embedded-new-index.html[Java API].
+
+[gremlin-groovy]
+----
+graph = Neo4jGraph.open('/tmp/neo4j')
+graph.cypher("CREATE INDEX ON :person(name)")
+graph.tx().commit()  <1>
+graph.addVertex(label,'person','name','marko')
+graph.addVertex(label,'dog','name','puppy')
+g = graph.traversal()
+g.V().hasLabel('person').has('name','marko').values('name')
+graph.close()
+----
+
+<1> Schema mutations must happen in a different transaction than graph mutations
+
+Below demonstrates the runtime benefits of indices and demonstrates how if there is no defined index (only vertex
+labels), a linear scan of the vertex-label partition is still faster than a linear scan of all vertices.
+
+[gremlin-groovy]
+----
+graph = Neo4jGraph.open('/tmp/neo4j')
+graph.io(graphml()).readGraph('data/grateful-dead.xml')
+g = graph.traversal()
+g.tx().commit()
+clock(1000) {g.V().hasLabel('artist').has('name','Garcia').iterate()}  <1>
+graph.cypher("CREATE INDEX ON :artist(name)") <2>
+g.tx().commit()
+Thread.sleep(5000) <3>
+clock(1000) {g.V().hasLabel('artist').has('name','Garcia').iterate()} <4>
+clock(1000) {g.V().has('name','Garcia').iterate()} <5>
+graph.cypher("DROP INDEX ON :artist(name)") <6>
+g.tx().commit()
+graph.close()
+----
+
+<1> Find all artists whose name is Garcia which does a linear scan of the artist vertex-label partition.
+<2> Create an index for all artist vertices on their name property.
+<3> Neo4j indices are eventually consistent so this stalls to give the index time to populate itself.
+<4> Find all artists whose name is Garcia which uses the pre-defined schema index.
+<5> Find all vertices whose name is Garcia which requires a linear scan of all the data in the graph.
+<6> Drop the created index.
+
+Multi/Meta-Properties
+~~~~~~~~~~~~~~~~~~~~~
+
+`Neo4jGraph` supports both multi- and meta-properties (see <<_vertex_properties,vertex properties>>). These features
+are not native to Neo4j and are implemented using "hidden" Neo4j nodes. For example, when a vertex has multiple
+"name" properties, each property is a new node (multi-properties) which can have properties attached to it
+(meta-properties). As such, the native, underlying representation may become difficult to query directly using
+another graph language such as <<_cypher,Cypher>>. The default setting is to disable multi- and meta-properties.
+However, if this feature is desired, then it can be activated via `gremlin.neo4j.metaProperties` and
+`gremlin.neo4j.multiProperties` configurations being set to `true`. Once the configuration is set, it can not be
+changed for the lifetime of the graph.
+
+[gremlin-groovy]
+----
+conf = new BaseConfiguration()
+conf.setProperty('gremlin.neo4j.directory','/tmp/neo4j')
+conf.setProperty('gremlin.neo4j.multiProperties',true)
+conf.setProperty('gremlin.neo4j.metaProperties',true)
+graph = Neo4jGraph.open(conf)
+g = graph.traversal()
+g.addV('name','michael','name','michael hunger','name','mhunger')
+g.V().properties('name').property('acl', 'public')
+g.V(0).valueMap()
+g.V(0).properties()
+g.V(0).properties().valueMap()
+graph.close()
+----
+
+WARNING: `Neo4jGraph` without multi- and meta-properties is in 1-to-1 correspondence with the native, underlying Neo4j
+representation. It is recommended that if the user does not require multi/meta-properties, then they should not
+enable them. Without multi- and meta-properties enabled, Neo4j can be interacted with with other tools and technologies
+that do not leverage TinkerPop.
+
+IMPORTANT: When using a multi-property enabled `Neo4jGraph`, vertices may represent their properties on "hidden
+nodes" adjacent to the vertex. If a vertex property key/value is required for indexing, then two indices are
+required -- e.g. `CREATE INDEX ON :person(name)` and `CREATE INDEX ON :vertexProperty(name)`
+(see <<_indices,Neo4j indices>>).
+
+Cypher
+~~~~~~
+
+image::gremlin-loves-cypher.png[width=400]
+
+NeoTechnology are the creators of the graph pattern-match query language link:http://www.neo4j.org/learn/cypher[Cypher].
+It is possible to leverage Cypher from within Gremlin by using the `Neo4jGraph.cypher()` graph traversal method.
+
+[gremlin-groovy]
+----
+graph = Neo4jGraph.open('/tmp/neo4j')
+graph.io(gryo()).readGraph('data/tinkerpop-modern.kryo')
+graph.cypher('MATCH (a {name:"marko"}) RETURN a')
+graph.cypher('MATCH (a {name:"marko"}) RETURN a').select('a').out('knows').values('name')
+graph.close()
+----
+
+Thus, like <<match-step,`match()`>>-step in Gremlin, it is possible to do a declarative pattern match and then move
+back into imperative Gremlin.
+
+TIP: For those developers using <<gremlin-server,Gremlin Server>> against Neo4j, it is possible to do Cypher queries
+by simply placing the Cypher string in `graph.cypher(...)` before submission to the server.
+
+Multi-Label
+~~~~~~~~~~~
+
+TinkerPop3 requires every `Element` to have a single, immutable string label (i.e. a `Vertex`, `Edge`, and
+`VertexProperty`). In Neo4j, a `Node` (vertex) can have an
+link:http://neo4j.com/docs/stable/graphdb-neo4j-labels.html[arbitrary number of labels] while a `Relationship`
+(edge) can have one and only one. Furthermore, in Neo4j, `Node` labels are mutable while `Relationship` labels are
+not. In order to handle this mismatch, three `Neo4jVertex` specific methods exist in Neo4j-Gremlin.
+
+[source,java]
+public Set<String> labels() // get all the labels of the vertex
+public void addLabel(String label) // add a label to the vertex
+public void removeLabel(String label) // remove a label from the vertex
+
+An example use case is presented below.
+
+[gremlin-groovy]
+----
+graph = Neo4jGraph.open('/tmp/neo4j')
+vertex = (Neo4jVertex) graph.addVertex('human::animal') <1>
+vertex.label() <2>
+vertex.labels() <3>
+vertex.addLabel('organism') <4>
+vertex.label()
+vertex.removeLabel('human') <5>
+vertex.labels()
+vertex.addLabel('organism') <6>
+vertex.labels()
+vertex.removeLabel('human') <7>
+vertex.label()
+g = graph.traversal()
+g.V().has(label,'organism') <8>
+g.V().has(label,of('organism')) <9>
+g.V().has(label,of('organism')).has(label,of('animal'))
+g.V().has(label,of('organism').and(of('animal')))
+graph.close()
+----
+
+<1> Typecasting to a `Neo4jVertex` is only required in Java.
+<2> The standard `Vertex.label()` method returns all the labels in alphabetical order concatenated using `::`.
+<3> `Neo4jVertex.labels()` method returns the individual labels as a set.
+<4> `Neo4jVertex.addLabel()` method adds a single label.
+<5> `Neo4jVertex.removeLabel()` method removes a single label.
+<6> Labels are unique and thus duplicate labels don't exist.
+<7> If a label that does not exist is removed, nothing happens.
+<8> `P.eq()` does a full string match and should only be used if multi-labels are not leveraged.
+<9> `LabelP.of()` is specific to `Neo4jGraph` and used for multi-label matching.
+
+IMPORTANT: `LabelP.of()` is only required if multi-labels are leveraged. `LabelP.of()` is used when
+filtering/looking-up vertices by their label(s) as the standard `P.eq()` does a direct match on the `::`-representation
+of `vertex.label()`
+
+Loading with BulkLoaderVertexProgram
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The <<bulkloadervertexprogram, BulkLoaderVertexProgram>> is a generalized bulk loader that can be used to load
+large amounts of data to and from Neo4j. The following code demonstrates how to load the modern graph from TinkerGraph
+into Neo4j:
+
+[gremlin-groovy]
+----
+wgConf = 'conf/neo4j-standalone.properties'
+modern = TinkerFactory.createModern()
+blvp = BulkLoaderVertexProgram.build().
+           keepOriginalIds(false).
+           writeGraph(wgConf).create(modern)
+modern.compute().workers(1).program(blvp).submit().get()
+graph = GraphFactory.open(wgConf)
+g = graph.traversal()
+g.V().valueMap()
+graph.close()
+----
+
+[source,properties]
+----
+# neo4j-standalone.properties
+
+gremlin.graph=org.apache.tinkerpop.gremlin.neo4j.structure.Neo4jGraph
+gremlin.neo4j.directory=/tmp/neo4j
+gremlin.neo4j.conf.node_auto_indexing=true
+gremlin.neo4j.conf.relationship_auto_indexing=true
+----



Mime
View raw message