http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/gelly/index.md

diff git a/docs/apis/batch/libs/gelly/index.md b/docs/apis/batch/libs/gelly/index.md
deleted file mode 100644
index 643a318..0000000
 a/docs/apis/batch/libs/gelly/index.md
+++ /dev/null
@@ 1,74 +0,0 @@

title: "Gelly: Flink Graph API"
# Top navigation
topnavgroup: libs
topnavpos: 1
topnavtitle: "Graphs: Gelly"
# Sub navigation
subnavgroup: batch
subnavid: gelly
subnavpos: 1
subnavparent: libs
subnavtitle: Gelly

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

Gelly is a Graph API for Flink. It contains a set of methods and utilities which aim to simplify the development of graph analysis applications in Flink. In Gelly, graphs can be transformed and modified using highlevel functions similar to the ones provided by the batch processing API. Gelly provides methods to create, transform and modify graphs, as well as a library of graph algorithms.

{:#markdowntoc}
* [Graph API](graph_api.html)
* [Iterative Graph Processing](iterative_graph_processing.html)
* [Library Methods](library_methods.html)
* [Graph Algorithms](graph_algorithms.html)
* [Graph Generators](graph_generators.html)

Using Gelly


Gelly is currently part of the *libraries* Maven project. All relevant classes are located in the *org.apache.flink.graph* package.

Add the following dependency to your `pom.xml` to use Gelly.

<div class="codetabs" markdown="1">
<div datalang="java" markdown="1">
{% highlight xml %}
<dependency>
 <groupId>org.apache.flink</groupId>
 <artifactId>flinkgelly{{ site.scala_version_suffix }}</artifactId>
 <version>{{site.version}}</version>
</dependency>
{% endhighlight %}
</div>
<div datalang="scala" markdown="1">
{% highlight xml %}
<dependency>
 <groupId>org.apache.flink</groupId>
 <artifactId>flinkgellyscala{{ site.scala_version_suffix }}</artifactId>
 <version>{{site.version}}</version>
</dependency>
{% endhighlight %}
</div>
</div>

Note that Gelly is currently not part of the binary distribution. See linking with it for cluster execution [here]({{ site.baseurl }}/apis/cluster_execution.html#linkingwithmodulesnotcontainedinthebinarydistribution).

The remaining sections provide a description of available methods and present several examples of how to use Gelly and how to mix it with the Flink DataSet API. After reading this guide, you might also want to check the {% gh_link /flinklibraries/flinkgellyexamples/ "Gelly examples" %}.

{% top %}
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/gelly/iterative_graph_processing.md

diff git a/docs/apis/batch/libs/gelly/iterative_graph_processing.md b/docs/apis/batch/libs/gelly/iterative_graph_processing.md
deleted file mode 100644
index d7a096e..0000000
 a/docs/apis/batch/libs/gelly/iterative_graph_processing.md
+++ /dev/null
@@ 1,971 +0,0 @@

title: Iterative Graph Processing

# Sub navigation
subnavgroup: batch
subnavparent: gelly
subnavtitle: Iterative Graph Processing

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

Gelly exploits Flink's efficient iteration operators to support largescale iterative graph processing. Currently, we provide implementations of the vertexcentric, scattergather, and gathersumapply models. In the following sections, we describe these abstractions and show how you can use them in Gelly.

* This will be replaced by the TOC
{:toc}

## VertexCentric Iterations
The vertexcentric model, also known as "think like a vertex" or "Pregel", expresses computation from the perspective of a vertex in the graph.
The computation proceeds in synchronized iteration steps, called supersteps. In each superstep, each vertex executes one userdefined function.
Vertices communicate with other vertices through messages. A vertex can send a message to any other vertex in the graph, as long as it knows its unique ID.

The computational model is shown in the figure below. The dotted boxes correspond to parallelization units.
In each superstep, all active vertices execute the
same userdefined computation in parallel. Supersteps are executed synchronously, so that messages sent during one superstep are guaranteed to be delivered in the beginning of the next superstep.

<p class="textcenter">
 <img alt="VertexCentric Computational Model" width="70%" src="fig/vertexcentric supersteps.png"/>
</p>

To use vertexcentric iterations in Gelly, the user only needs to define the vertex compute function, `ComputeFunction`.
This function and the maximum number of iterations to run are given as parameters to Gelly's `runVertexCentricIteration`. This method will execute the vertexcentric iteration on the input Graph and return a new Graph, with updated vertex values. An optional message combiner, `MessageCombiner`, can be defined to reduce communication costs.

Let us consider computing SingleSourceShortestPaths with vertexcentric iterations. Initially, each vertex has a value of infinite distance, except from the source vertex, which has a value of zero. During the first superstep, the source propagates distances to its neighbors. During the following supersteps, each vertex checks its received messages and chooses the minimum distance among them. If this distance is smaller than its current value, it updates its state and produces messages for its neighbors. If a vertex does not change its value during a superstep, then it does not produce any messages for its neighbors for the next superstep. The algorithm converges when there are no value updates or the maximum number of supersteps has been reached. In this algorithm, a message combiner can be used to reduce the number of messages sent to a target vertex.


<div class="codetabs" markdown="1">
<div datalang="java" markdown="1">
{% highlight java %}
// read the input graph
Graph<Long, Double, Double> graph = ...

// define the maximum number of iterations
int maxIterations = 10;

// Execute the vertexcentric iteration
Graph<Long, Double, Double> result = graph.runVertexCentricIteration(
 new SSSPComputeFunction(), new SSSPCombiner(), maxIterations);

// Extract the vertices as the result
DataSet<Vertex<Long, Double>> singleSourceShortestPaths = result.getVertices();


//    UDFs    //

public static final class SSSPComputeFunction extends ComputeFunction<Long, Double, Double, Double> {

public void compute(Vertex<Long, Double> vertex, MessageIterator<Double> messages) {

 double minDistance = (vertex.getId().equals(srcId)) ? 0d : Double.POSITIVE_INFINITY;

 for (Double msg : messages) {
 minDistance = Math.min(minDistance, msg);
 }

 if (minDistance < vertex.getValue()) {
 setNewVertexValue(minDistance);
 for (Edge<Long, Double> e: getEdges()) {
 sendMessageTo(e.getTarget(), minDistance + e.getValue());
 }
 }
}

// message combiner
public static final class SSSPCombiner extends MessageCombiner<Long, Double> {

 public void combineMessages(MessageIterator<Double> messages) {

 double minMessage = Double.POSITIVE_INFINITY;
 for (Double msg: messages) {
 minMessage = Math.min(minMessage, msg);
 }
 sendCombinedMessage(minMessage);
 }
}

{% endhighlight %}
</div>

<div datalang="scala" markdown="1">
{% highlight scala %}
// read the input graph
val graph: Graph[Long, Double, Double] = ...

// define the maximum number of iterations
val maxIterations = 10

// Execute the vertexcentric iteration
val result = graph.runVertexCentricIteration(new SSSPComputeFunction, new SSSPCombiner, maxIterations)

// Extract the vertices as the result
val singleSourceShortestPaths = result.getVertices


//    UDFs    //

final class SSSPComputeFunction extends ComputeFunction[Long, Double, Double, Double] {

 override def compute(vertex: Vertex[Long, Double], messages: MessageIterator[Double]) = {

 var minDistance = if (vertex.getId.equals(srcId)) 0 else Double.MaxValue

 while (messages.hasNext) {
 val msg = messages.next
 if (msg < minDistance) {
 minDistance = msg
 }
 }

 if (vertex.getValue > minDistance) {
 setNewVertexValue(minDistance)
 for (edge: Edge[Long, Double] < getEdges) {
 sendMessageTo(edge.getTarget, vertex.getValue + edge.getValue)
 }
 }
}

// message combiner
final class SSSPCombiner extends MessageCombiner[Long, Double] {

 override def combineMessages(messages: MessageIterator[Double]) {

 var minDistance = Double.MaxValue

 while (messages.hasNext) {
 val msg = inMessages.next
 if (msg < minDistance) {
 minDistance = msg
 }
 }
 sendCombinedMessage(minMessage)
 }
}
{% endhighlight %}
</div>
</div>

{% top %}

## Configuring a VertexCentric Iteration
A vertexcentric iteration can be configured using a `VertexCentricConfiguration` object.
Currently, the following parameters can be specified:

* <strong>Name</strong>: The name for the vertexcentric iteration. The name is displayed in logs and messages
and can be specified using the `setName()` method.

* <strong>Parallelism</strong>: The parallelism for the iteration. It can be set using the `setParallelism()` method.

* <strong>Solution set in unmanaged memory</strong>: Defines whether the solution set is kept in managed memory (Flink's internal way of keeping objects in serialized form) or as a simple object map. By default, the solution set runs in managed memory. This property can be set using the `setSolutionSetUnmanagedMemory()` method.

* <strong>Aggregators</strong>: Iteration aggregators can be registered using the `registerAggregator()` method. An iteration aggregator combines
all aggregates globally once per superstep and makes them available in the next superstep. Registered aggregators can be accessed inside the userdefined `ComputeFunction`.

* <strong>Broadcast Variables</strong>: DataSets can be added as [Broadcast Variables]({{site.baseurl}}/apis/batch/index.html#broadcastvariables) to the `ComputeFunction`, using the `addBroadcastSet()` method.

<div class="codetabs" markdown="1">
<div datalang="java" markdown="1">
{% highlight java %}

Graph<Long, Double, Double> graph = ...

// configure the iteration
VertexCentricConfiguration parameters = new VertexCentricConfiguration();

// set the iteration name
parameters.setName("Gelly Iteration");

// set the parallelism
parameters.setParallelism(16);

// register an aggregator
parameters.registerAggregator("sumAggregator", new LongSumAggregator());

// run the vertexcentric iteration, also passing the configuration parameters
Graph<Long, Long, Double> result =
 graph.runVertexCentricIteration(
 new Compute(), null, maxIterations, parameters);

// userdefined function
public static final class Compute extends ComputeFunction {

 LongSumAggregator aggregator = new LongSumAggregator();

 public void preSuperstep() {

 // retrieve the Aggregator
 aggregator = getIterationAggregator("sumAggregator");
 }


 public void compute(Vertex<Long, Long> vertex, MessageIterator inMessages) {

 //do some computation
 Long partialValue = ...

 // aggregate the partial value
 aggregator.aggregate(partialValue);

 // update the vertex value
 setNewVertexValue(...);
 }
}

{% endhighlight %}
</div>

<div datalang="scala" markdown="1">
{% highlight scala %}

val graph: Graph[Long, Long, Double] = ...

val parameters = new VertexCentricConfiguration

// set the iteration name
parameters.setName("Gelly Iteration")

// set the parallelism
parameters.setParallelism(16)

// register an aggregator
parameters.registerAggregator("sumAggregator", new LongSumAggregator)

// run the vertexcentric iteration, also passing the configuration parameters
val result = graph.runVertexCentricIteration(new Compute, new Combiner, maxIterations, parameters)

// userdefined function
final class Compute extends ComputeFunction {

 var aggregator = new LongSumAggregator

 override def preSuperstep {

 // retrieve the Aggregator
 aggregator = getIterationAggregator("sumAggregator")
 }


 override def compute(vertex: Vertex[Long, Long], inMessages: MessageIterator[Long]) {

 //do some computation
 val partialValue = ...

 // aggregate the partial value
 aggregator.aggregate(partialValue)

 // update the vertex value
 setNewVertexValue(...)
 }
}

{% endhighlight %}
</div>
</div>

{% top %}

## ScatterGather Iterations
The scattergather model, also known as "signal/collect" model, expresses computation from the perspective of a vertex in the graph. The computation proceeds in synchronized iteration steps, called supersteps. In each superstep, a vertex produces messages for other vertices and updates its value based on the messages it receives. To use scattergather iterations in Gelly, the user only needs to define how a vertex behaves in each superstep:

* <strong>Scatter</strong>: produces the messages that a vertex will send to other vertices.
* <strong>Gather</strong>: updates the vertex value using received messages.

Gelly provides methods for scattergather iterations. The user only needs to implement two functions, corresponding to the scatter and gather phases. The first function is a `ScatterFunction`, which allows a vertex to send out messages to other vertices. Messages are received during the same superstep as they are sent. The second function is `GatherFunction`, which defines how a vertex will update its value based on the received messages.
These functions and the maximum number of iterations to run are given as parameters to Gelly's `runScatterGatherIteration`. This method will execute the scattergather iteration on the input Graph and return a new Graph, with updated vertex values.

A scattergather iteration can be extended with information such as the total number of vertices, the in degree and out degree.
Additionally, the neighborhood type (in/out/all) over which to run the scattergather iteration can be specified. By default, the updates from the inneighbors are used to modify the current vertex's state and messages are sent to outneighbors.

Let us consider computing SingleSourceShortestPaths with scattergather iterations on the following graph and let vertex 1 be the source. In each superstep, each vertex sends a candidate distance message to all its neighbors. The message value is the sum of the current value of the vertex and the edge weight connecting this vertex with its neighbor. Upon receiving candidate distance messages, each vertex calculates the minimum distance and, if a shorter path has been discovered, it updates its value. If a vertex does not change its value during a superstep, then it does not produce messages for its neighbors for the next superstep. The algorithm converges when there are no value updates.

<p class="textcenter">
 <img alt="Scattergather SSSP superstep 1" width="70%" src="fig/gellyvcsssp1.png"/>
</p>

<div class="codetabs" markdown="1">
<div datalang="java" markdown="1">
{% highlight java %}
// read the input graph
Graph<Long, Double, Double> graph = ...

// define the maximum number of iterations
int maxIterations = 10;

// Execute the scattergather iteration
Graph<Long, Double, Double> result = graph.runScatterGatherIteration(
 new MinDistanceMessenger(), new VertexDistanceUpdater(), maxIterations);

// Extract the vertices as the result
DataSet<Vertex<Long, Double>> singleSourceShortestPaths = result.getVertices();


//    UDFs    //

// scatter: messaging
public static final class MinDistanceMessenger extends ScatterFunction<Long, Double, Double, Double> {

 public void sendMessages(Vertex<Long, Double> vertex) {
 for (Edge<Long, Double> edge : getEdges()) {
 sendMessageTo(edge.getTarget(), vertex.getValue() + edge.getValue());
 }
 }
}

// gather: vertex update
public static final class VertexDistanceUpdater extends GatherFunction<Long, Double, Double> {

 public void updateVertex(Vertex<Long, Double> vertex, MessageIterator<Double> inMessages) {
 Double minDistance = Double.MAX_VALUE;

 for (double msg : inMessages) {
 if (msg < minDistance) {
 minDistance = msg;
 }
 }

 if (vertex.getValue() > minDistance) {
 setNewVertexValue(minDistance);
 }
 }
}

{% endhighlight %}
</div>

<div datalang="scala" markdown="1">
{% highlight scala %}
// read the input graph
val graph: Graph[Long, Double, Double] = ...

// define the maximum number of iterations
val maxIterations = 10

// Execute the scattergather iteration
val result = graph.runScatterGatherIteration(new MinDistanceMessenger, new VertexDistanceUpdater, maxIterations)

// Extract the vertices as the result
val singleSourceShortestPaths = result.getVertices


//    UDFs    //

// messaging
final class MinDistanceMessenger extends ScatterFunction[Long, Double, Double, Double] {

 override def sendMessages(vertex: Vertex[Long, Double]) = {
 for (edge: Edge[Long, Double] < getEdges) {
 sendMessageTo(edge.getTarget, vertex.getValue + edge.getValue)
 }
 }
}

// vertex update
final class VertexDistanceUpdater extends GatherFunction[Long, Double, Double] {

 override def updateVertex(vertex: Vertex[Long, Double], inMessages: MessageIterator[Double]) = {
 var minDistance = Double.MaxValue

 while (inMessages.hasNext) {
 val msg = inMessages.next
 if (msg < minDistance) {
 minDistance = msg
 }
 }

 if (vertex.getValue > minDistance) {
 setNewVertexValue(minDistance)
 }
 }
}
{% endhighlight %}
</div>
</div>

{% top %}

## Configuring a ScatterGather Iteration
A scattergather iteration can be configured using a `ScatterGatherConfiguration` object.
Currently, the following parameters can be specified:

* <strong>Name</strong>: The name for the scattergather iteration. The name is displayed in logs and messages
and can be specified using the `setName()` method.

* <strong>Parallelism</strong>: The parallelism for the iteration. It can be set using the `setParallelism()` method.

* <strong>Solution set in unmanaged memory</strong>: Defines whether the solution set is kept in managed memory (Flink's internal way of keeping objects in serialized form) or as a simple object map. By default, the solution set runs in managed memory. This property can be set using the `setSolutionSetUnmanagedMemory()` method.

* <strong>Aggregators</strong>: Iteration aggregators can be registered using the `registerAggregator()` method. An iteration aggregator combines
all aggregates globally once per superstep and makes them available in the next superstep. Registered aggregators can be accessed inside the userdefined `ScatterFunction` and `GatherFunction`.

* <strong>Broadcast Variables</strong>: DataSets can be added as [Broadcast Variables]({{site.baseurl}}/apis/batch/index.html#broadcastvariables) to the `ScatterFunction` and `GatherFunction`, using the `addBroadcastSetForUpdateFunction()` and `addBroadcastSetForMessagingFunction()` methods, respectively.

* <strong>Number of Vertices</strong>: Accessing the total number of vertices within the iteration. This property can be set using the `setOptNumVertices()` method.
The number of vertices can then be accessed in the vertex update function and in the messaging function using the `getNumberOfVertices()` method. If the option is not set in the configuration, this method will return 1.

* <strong>Degrees</strong>: Accessing the in/out degree for a vertex within an iteration. This property can be set using the `setOptDegrees()` method.
The in/out degrees can then be accessed in the vertex update function and in the messaging function, per vertex using the `getInDegree()` and `getOutDegree()` methods.
If the degrees option is not set in the configuration, these methods will return 1.

* <strong>Messaging Direction</strong>: By default, a vertex sends messages to its outneighbors and updates its value based on messages received from its inneighbors. This configuration option allows users to change the messaging direction to either `EdgeDirection.IN`, `EdgeDirection.OUT`, `EdgeDirection.ALL`. The messaging direction also dictates the update direction which would be `EdgeDirection.OUT`, `EdgeDirection.IN` and `EdgeDirection.ALL`, respectively. This property can be set using the `setDirection()` method.

<div class="codetabs" markdown="1">
<div datalang="java" markdown="1">
{% highlight java %}

Graph<Long, Double, Double> graph = ...

// configure the iteration
ScatterGatherConfiguration parameters = new ScatterGatherConfiguration();

// set the iteration name
parameters.setName("Gelly Iteration");

// set the parallelism
parameters.setParallelism(16);

// register an aggregator
parameters.registerAggregator("sumAggregator", new LongSumAggregator());

// run the scattergather iteration, also passing the configuration parameters
Graph<Long, Double, Double> result =
 graph.runScatterGatherIteration(
 new Messenger(), new VertexUpdater(), maxIterations, parameters);

// userdefined functions
public static final class Messenger extends ScatterFunction {...}

public static final class VertexUpdater extends GatherFunction {

 LongSumAggregator aggregator = new LongSumAggregator();

 public void preSuperstep() {

 // retrieve the Aggregator
 aggregator = getIterationAggregator("sumAggregator");
 }


 public void updateVertex(Vertex<Long, Long> vertex, MessageIterator inMessages) {

 //do some computation
 Long partialValue = ...

 // aggregate the partial value
 aggregator.aggregate(partialValue);

 // update the vertex value
 setNewVertexValue(...);
 }
}

{% endhighlight %}
</div>

<div datalang="scala" markdown="1">
{% highlight scala %}

val graph: Graph[Long, Double, Double] = ...

val parameters = new ScatterGatherConfiguration

// set the iteration name
parameters.setName("Gelly Iteration")

// set the parallelism
parameters.setParallelism(16)

// register an aggregator
parameters.registerAggregator("sumAggregator", new LongSumAggregator)

// run the scattergather iteration, also passing the configuration parameters
val result = graph.runScatterGatherIteration(new Messenger, new VertexUpdater, maxIterations, parameters)

// userdefined functions
final class Messenger extends ScatterFunction {...}

final class VertexUpdater extends GatherFunction {

 var aggregator = new LongSumAggregator

 override def preSuperstep {

 // retrieve the Aggregator
 aggregator = getIterationAggregator("sumAggregator")
 }


 override def updateVertex(vertex: Vertex[Long, Long], inMessages: MessageIterator[Long]) {

 //do some computation
 val partialValue = ...

 // aggregate the partial value
 aggregator.aggregate(partialValue)

 // update the vertex value
 setNewVertexValue(...)
 }
}

{% endhighlight %}
</div>
</div>

The following example illustrates the usage of the degree as well as the number of vertices options.

<div class="codetabs" markdown="1">
<div datalang="java" markdown="1">
{% highlight java %}

Graph<Long, Double, Double> graph = ...

// configure the iteration
ScatterGatherConfiguration parameters = new ScatterGatherConfiguration();

// set the number of vertices option to true
parameters.setOptNumVertices(true);

// set the degree option to true
parameters.setOptDegrees(true);

// run the scattergather iteration, also passing the configuration parameters
Graph<Long, Double, Double> result =
 graph.runScatterGatherIteration(
 new Messenger(), new VertexUpdater(), maxIterations, parameters);

// userdefined functions
public static final class Messenger extends ScatterFunction {
 ...
 // retrieve the vertex outdegree
 outDegree = getOutDegree();
 ...
}

public static final class VertexUpdater extends GatherFunction {
 ...
 // get the number of vertices
 long numVertices = getNumberOfVertices();
 ...
}

{% endhighlight %}
</div>

<div datalang="scala" markdown="1">
{% highlight scala %}

val graph: Graph[Long, Double, Double] = ...

// configure the iteration
val parameters = new ScatterGatherConfiguration

// set the number of vertices option to true
parameters.setOptNumVertices(true)

// set the degree option to true
parameters.setOptDegrees(true)

// run the scattergather iteration, also passing the configuration parameters
val result = graph.runScatterGatherIteration(new Messenger, new VertexUpdater, maxIterations, parameters)

// userdefined functions
final class Messenger extends ScatterFunction {
 ...
 // retrieve the vertex outdegree
 val outDegree = getOutDegree
 ...
}

final class VertexUpdater extends GatherFunction {
 ...
 // get the number of vertices
 val numVertices = getNumberOfVertices
 ...
}

{% endhighlight %}
</div>
</div>

The following example illustrates the usage of the edge direction option. Vertices update their values to contain a list of all their inneighbors.

<div class="codetabs" markdown="1">
<div datalang="java" markdown="1">
{% highlight java %}
Graph<Long, HashSet<Long>, Double> graph = ...

// configure the iteration
ScatterGatherConfiguration parameters = new ScatterGatherConfiguration();

// set the messaging direction
parameters.setDirection(EdgeDirection.IN);

// run the scattergather iteration, also passing the configuration parameters
DataSet<Vertex<Long, HashSet<Long>>> result =
 graph.runScatterGatherIteration(
 new Messenger(), new VertexUpdater(), maxIterations, parameters)
 .getVertices();

// userdefined functions
public static final class Messenger extends GatherFunction {...}

public static final class VertexUpdater extends ScatterFunction {...}

{% endhighlight %}
</div>

<div datalang="scala" markdown="1">
{% highlight scala %}
val graph: Graph[Long, HashSet[Long], Double] = ...

// configure the iteration
val parameters = new ScatterGatherConfiguration

// set the messaging direction
parameters.setDirection(EdgeDirection.IN)

// run the scattergather iteration, also passing the configuration parameters
val result = graph.runScatterGatherIteration(new Messenger, new VertexUpdater, maxIterations, parameters)
 .getVertices

// userdefined functions
final class Messenger extends ScatterFunction {...}

final class VertexUpdater extends GatherFunction {...}

{% endhighlight %}
</div>
</div>

{% top %}

## GatherSumApply Iterations
Like in the scattergather model, GatherSumApply also proceeds in synchronized iterative steps, called supersteps. Each superstep consists of the following three phases:

* <strong>Gather</strong>: a userdefined function is invoked in parallel on the edges and neighbors of each vertex, producing a partial value.
* <strong>Sum</strong>: the partial values produced in the Gather phase are aggregated to a single value, using a userdefined reducer.
* <strong>Apply</strong>: each vertex value is updated by applying a function on the current value and the aggregated value produced by the Sum phase.

Let us consider computing SingleSourceShortestPaths with GSA on the following graph and let vertex 1 be the source. During the `Gather` phase, we calculate the new candidate distances, by adding each vertex value with the edge weight. In `Sum`, the candidate distances are grouped by vertex ID and the minimum distance is chosen. In `Apply`, the newly calculated distance is compared to the current vertex value and the minimum of the two is assigned as the new value of the vertex.

<p class="textcenter">
 <img alt="GSA SSSP superstep 1" width="70%" src="fig/gellygsasssp1.png"/>
</p>

Notice that, if a vertex does not change its value during a superstep, it will not calculate candidate distance during the next superstep. The algorithm converges when no vertex changes value.

To implement this example in Gelly GSA, the user only needs to call the `runGatherSumApplyIteration` method on the input graph and provide the `GatherFunction`, `SumFunction` and `ApplyFunction` UDFs. Iteration synchronization, grouping, value updates and convergence are handled by the system:

<div class="codetabs" markdown="1">
<div datalang="java" markdown="1">
{% highlight java %}
// read the input graph
Graph<Long, Double, Double> graph = ...

// define the maximum number of iterations
int maxIterations = 10;

// Execute the GSA iteration
Graph<Long, Double, Double> result = graph.runGatherSumApplyIteration(
 new CalculateDistances(), new ChooseMinDistance(), new UpdateDistance(), maxIterations);

// Extract the vertices as the result
DataSet<Vertex<Long, Double>> singleSourceShortestPaths = result.getVertices();


//    UDFs    //

// Gather
private static final class CalculateDistances extends GatherFunction<Double, Double, Double> {

 public Double gather(Neighbor<Double, Double> neighbor) {
 return neighbor.getNeighborValue() + neighbor.getEdgeValue();
 }
}

// Sum
private static final class ChooseMinDistance extends SumFunction<Double, Double, Double> {

 public Double sum(Double newValue, Double currentValue) {
 return Math.min(newValue, currentValue);
 }
}

// Apply
private static final class UpdateDistance extends ApplyFunction<Long, Double, Double> {

 public void apply(Double newDistance, Double oldDistance) {
 if (newDistance < oldDistance) {
 setResult(newDistance);
 }
 }
}

{% endhighlight %}
</div>

<div datalang="scala" markdown="1">
{% highlight scala %}
// read the input graph
val graph: Graph[Long, Double, Double] = ...

// define the maximum number of iterations
val maxIterations = 10

// Execute the GSA iteration
val result = graph.runGatherSumApplyIteration(new CalculateDistances, new ChooseMinDistance, new UpdateDistance, maxIterations)

// Extract the vertices as the result
val singleSourceShortestPaths = result.getVertices


//    UDFs    //

// Gather
final class CalculateDistances extends GatherFunction[Double, Double, Double] {

 override def gather(neighbor: Neighbor[Double, Double]): Double = {
 neighbor.getNeighborValue + neighbor.getEdgeValue
 }
}

// Sum
final class ChooseMinDistance extends SumFunction[Double, Double, Double] {

 override def sum(newValue: Double, currentValue: Double): Double = {
 Math.min(newValue, currentValue)
 }
}

// Apply
final class UpdateDistance extends ApplyFunction[Long, Double, Double] {

 override def apply(newDistance: Double, oldDistance: Double) = {
 if (newDistance < oldDistance) {
 setResult(newDistance)
 }
 }
}

{% endhighlight %}
</div>
</div>

Note that `gather` takes a `Neighbor` type as an argument. This is a convenience type which simply wraps a vertex with its neighboring edge.

For more examples of how to implement algorithms with the GatherSumApply model, check the {% gh_link /flinklibraries/flinkgelly/src/main/java/org/apache/flink/graph/library/GSAPageRank.java "GSAPageRank" %} and {% gh_link /flinklibraries/flinkgelly/src/main/java/org/apache/flink/graph/library/GSAConnectedComponents.java "GSAConnectedComponents" %} library methods of Gelly.

{% top %}

## Configuring a GatherSumApply Iteration
A GSA iteration can be configured using a `GSAConfiguration` object.
Currently, the following parameters can be specified:

* <strong>Name</strong>: The name for the GSA iteration. The name is displayed in logs and messages and can be specified using the `setName()` method.

* <strong>Parallelism</strong>: The parallelism for the iteration. It can be set using the `setParallelism()` method.

* <strong>Solution set in unmanaged memory</strong>: Defines whether the solution set is kept in managed memory (Flink's internal way of keeping objects in serialized form) or as a simple object map. By default, the solution set runs in managed memory. This property can be set using the `setSolutionSetUnmanagedMemory()` method.

* <strong>Aggregators</strong>: Iteration aggregators can be registered using the `registerAggregator()` method. An iteration aggregator combines all aggregates globally once per superstep and makes them available in the next superstep. Registered aggregators can be accessed inside the userdefined `GatherFunction`, `SumFunction` and `ApplyFunction`.

* <strong>Broadcast Variables</strong>: DataSets can be added as [Broadcast Variables]({{site.baseurl}}/apis/index.html#broadcastvariables) to the `GatherFunction`, `SumFunction` and `ApplyFunction`, using the methods `addBroadcastSetForGatherFunction()`, `addBroadcastSetForSumFunction()` and `addBroadcastSetForApplyFunction` methods, respectively.

* <strong>Number of Vertices</strong>: Accessing the total number of vertices within the iteration. This property can be set using the `setOptNumVertices()` method.
The number of vertices can then be accessed in the gather, sum and/or apply functions by using the `getNumberOfVertices()` method. If the option is not set in the configuration, this method will return 1.

* <strong>Neighbor Direction</strong>: By default values are gathered from the out neighbors of the Vertex. This can be modified
using the `setDirection()` method.

The following example illustrates the usage of the number of vertices option.

<div class="codetabs" markdown="1">
<div datalang="java" markdown="1">
{% highlight java %}

Graph<Long, Double, Double> graph = ...

// configure the iteration
GSAConfiguration parameters = new GSAConfiguration();

// set the number of vertices option to true
parameters.setOptNumVertices(true);

// run the gathersumapply iteration, also passing the configuration parameters
Graph<Long, Long, Long> result = graph.runGatherSumApplyIteration(
 new Gather(), new Sum(), new Apply(),
 maxIterations, parameters);

// userdefined functions
public static final class Gather {
 ...
 // get the number of vertices
 long numVertices = getNumberOfVertices();
 ...
}

public static final class Sum {
 ...
 // get the number of vertices
 long numVertices = getNumberOfVertices();
 ...
}

public static final class Apply {
 ...
 // get the number of vertices
 long numVertices = getNumberOfVertices();
 ...
}

{% endhighlight %}
</div>

<div datalang="scala" markdown="1">
{% highlight scala %}

val graph: Graph[Long, Double, Double] = ...

// configure the iteration
val parameters = new GSAConfiguration

// set the number of vertices option to true
parameters.setOptNumVertices(true)

// run the gathersumapply iteration, also passing the configuration parameters
val result = graph.runGatherSumApplyIteration(new Gather, new Sum, new Apply, maxIterations, parameters)

// userdefined functions
final class Gather {
 ...
 // get the number of vertices
 val numVertices = getNumberOfVertices
 ...
}

final class Sum {
 ...
 // get the number of vertices
 val numVertices = getNumberOfVertices
 ...
}

final class Apply {
 ...
 // get the number of vertices
 val numVertices = getNumberOfVertices
 ...
}

{% endhighlight %}
</div>
</div>

The following example illustrates the usage of the edge direction option.
<div class="codetabs" markdown="1">
<div datalang="java" markdown="1">
{% highlight java %}

Graph<Long, HashSet<Long>, Double> graph = ...

// configure the iteration
GSAConfiguration parameters = new GSAConfiguration();

// set the messaging direction
parameters.setDirection(EdgeDirection.IN);

// run the gathersumapply iteration, also passing the configuration parameters
DataSet<Vertex<Long, HashSet<Long>>> result =
 graph.runGatherSumApplyIteration(
 new Gather(), new Sum(), new Apply(), maxIterations, parameters)
 .getVertices();
{% endhighlight %}
</div>

<div datalang="scala" markdown="1">
{% highlight scala %}

val graph: Graph[Long, HashSet[Long], Double] = ...

// configure the iteration
val parameters = new GSAConfiguration

// set the messaging direction
parameters.setDirection(EdgeDirection.IN)

// run the gathersumapply iteration, also passing the configuration parameters
val result = graph.runGatherSumApplyIteration(new Gather, new Sum, new Apply, maxIterations, parameters)
 .getVertices()
{% endhighlight %}
</div>
</div>
{% top %}

## Iteration Abstractions Comparison
Although the three iteration abstractions in Gelly seem quite similar, understanding their differences can lead to more performant and maintainable programs.
Among the three, the vertexcentric model is the most general model and supports arbitrary computation and messaging for each vertex. In the scattergather model, the logic of producing messages is decoupled from the logic of updating vertex values. Thus, programs written using scattergather are sometimes easier to follow and maintain.
Separating the messaging phase from the vertex value update logic not only makes some programs easier to follow but might also have a positive impact on performance. Scattergather implementations typically have lower memory requirements, because concurrent access to the inbox (messages received) and outbox (messages to send) data structures is not required. However, this characteristic also limits expressiveness and makes some computation patterns nonintuitive. Naturally, if an algorithm requires a vertex to concurrently access its inbox and outbox, then the expression of this algorithm in scattergather might be problematic. Strongly Connected Components and Approximate Maximum
Weight Matching are examples of such graph algorithms. A direct consequence of this restriction is that vertices cannot generate messages and update their states in the same phase. Thus, deciding whether to propagate a message based on its content would require storing it in the vertex value, so that the gather phase has access to it, in the following iteration step. Similarly, if the vertex update logic includes computation over the values of the neighboring edges, these have to be included inside a special message passed from the scatter to the gather phase. Such workarounds often lead to higher memory requirements and nonelegant, hard to understand algorithm implementations.

Gathersumapply iterations are also quite similar to scattergather iterations. In fact, any algorithm which can be expressed as a GSA iteration can also be written in the scattergather model. The messaging phase of the scattergather model is equivalent to the Gather and Sum steps of GSA: Gather can be seen as the phase where the messages are produced and Sum as the phase where they are routed to the target vertex. Similarly, the value update phase corresponds to the Apply step.

The main difference between the two implementations is that the Gather phase of GSA parallelizes the computation over the edges, while the messaging phase distributes the computation over the vertices. Using the SSSP examples above, we see that in the first superstep of the scattergather case, vertices 1, 2 and 3 produce messages in parallel. Vertex 1 produces 3 messages, while vertices 2 and 3 produce one message each. In the GSA case on the other hand, the computation is parallelized over the edges: the three candidate distance values of vertex 1 are produced in parallel. Thus, if the Gather step contains "heavy" computation, it might be a better idea to use GSA and spread out the computation, instead of burdening a single vertex. Another case when parallelizing over the edges might prove to be more efficient is when the input graph is skewed (some vertices have a lot more neighbors than others).

Another difference between the two implementations is that the scattergather implementation uses a `coGroup` operator internally, while GSA uses a `reduce`. Therefore, if the function that combines neighbor values (messages) requires the whole group of values for the computation, scattergather should be used. If the update function is associative and commutative, then the GSA's reducer is expected to give a more efficient implementation, as it can make use of a combiner.

Another thing to note is that GSA works strictly on neighborhoods, while in the vertexcentric and scattergather models, a vertex can send a message to any vertex, given that it knows its vertex ID, regardless of whether it is a neighbor. Finally, in Gelly's scattergather implementation, one can choose the messaging direction, i.e. the direction in which updates propagate. GSA does not support this yet, so each vertex will be updated based on the values of its inneighbors only.

The main differences among the Gelly iteration models are shown in the table below.


<table class="table tablebordered">
 <thead>
 <tr>
 <th class="textleft" style="width: 25%">Iteration Model</th>
 <th class="textcenter">Update Function</th>
 <th class="textcenter">Update Logic</th>
 <th class="textcenter">Communication Scope</th>
 <th class="textcenter">Communication Logic</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td>VertexCentric</td>
 <td>arbitrary</td>
 <td>arbitrary</td>
 <td>any vertex</td>
 <td>arbitrary</td>
</tr>
<tr>
 <td>ScatterGather</td>
 <td>arbitrary</td>
 <td>based on received messages</td>
 <td>any vertex</td>
 <td>based on vertex state</td>
</tr>
<tr>
 <td>GatherSumApply</td>
 <td>associative and commutative</td>
 <td>based on neighbors' values</td>
 <td>neighborhood</td>
 <td>based on vertex state</td>
</tr>
</tbody>
</table>

{% top %}
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/gelly/library_methods.md

diff git a/docs/apis/batch/libs/gelly/library_methods.md b/docs/apis/batch/libs/gelly/library_methods.md
deleted file mode 100644
index e49a793..0000000
 a/docs/apis/batch/libs/gelly/library_methods.md
+++ /dev/null
@@ 1,350 +0,0 @@

title: Library Methods

# Sub navigation
subnavgroup: batch
subnavparent: gelly
subnavtitle: Library Methods

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

Gelly has a growing collection of graph algorithms for easily analyzing largescale Graphs.

* This will be replaced by the TOC
{:toc}

Gelly's library methods can be used by simply calling the `run()` method on the input graph:

<div class="codetabs" markdown="1">
<div datalang="java" markdown="1">
{% highlight java %}
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

Graph<Long, Long, NullValue> graph = ...

// run Label Propagation for 30 iterations to detect communities on the input graph
DataSet<Vertex<Long, Long>> verticesWithCommunity = graph.run(new LabelPropagation<Long>(30));

// print the result
verticesWithCommunity.print();

{% endhighlight %}
</div>

<div datalang="scala" markdown="1">
{% highlight scala %}
val env = ExecutionEnvironment.getExecutionEnvironment

val graph: Graph[java.lang.Long, java.lang.Long, NullValue] = ...

// run Label Propagation for 30 iterations to detect communities on the input graph
val verticesWithCommunity = graph.run(new LabelPropagation[java.lang.Long, java.lang.Long, NullValue](30))

// print the result
verticesWithCommunity.print

{% endhighlight %}
</div>
</div>

## Community Detection

#### Overview
In graph theory, communities refer to groups of nodes that are well connected internally, but sparsely connected to other groups.
This library method is an implementation of the community detection algorithm described in the paper [Towards realtime community detection in large networks](http://arxiv.org/pdf/0808.2633.pdf%22%3Earticle%20explaining%20the%20algorithm%20in%20detail).

#### Details
The algorithm is implemented using [scattergather iterations](#scattergatheriterations).
Initially, each vertex is assigned a `Tuple2` containing its initial value along with a score equal to 1.0.
In each iteration, vertices send their labels and scores to their neighbors. Upon receiving messages from its neighbors,
a vertex chooses the label with the highest score and subsequently rescores it using the edge values,
a userdefined hop attenuation parameter, `delta`, and the superstep number.
The algorithm converges when vertices no longer update their value or when the maximum number of iterations
is reached.

#### Usage
The algorithm takes as input a `Graph` with any vertex type, `Long` vertex values, and `Double` edge values. It returns a `Graph` of the same type as the input,
where the vertex values correspond to the community labels, i.e. two vertices belong to the same community if they have the same vertex value.
The constructor takes two parameters:

* `maxIterations`: the maximum number of iterations to run.
* `delta`: the hop attenuation parameter, with default value 0.5.

## Label Propagation

#### Overview
This is an implementation of the wellknown Label Propagation algorithm described in [this paper](http://journals.aps.org/pre/abstract/10.1103/PhysRevE.76.036106). The algorithm discovers communities in a graph, by iteratively propagating labels between neighbors. Unlike the [Community Detection library method](#communitydetection), this implementation does not use scores associated with the labels.

#### Details
The algorithm is implemented using [scattergather iterations](#scattergatheriterations).
Labels are expected to be of type `Comparable` and are initialized using the vertex values of the input `Graph`.
The algorithm iteratively refines discovered communities by propagating labels. In each iteration, a vertex adopts
the label that is most frequent among its neighbors' labels. In case of a tie (i.e. two or more labels appear with the
same frequency), the algorithm picks the greater label. The algorithm converges when no vertex changes its value or
the maximum number of iterations has been reached. Note that different initializations might lead to different results.

#### Usage
The algorithm takes as input a `Graph` with a `Comparable` vertex type, a `Comparable` vertex value type and an arbitrary edge value type.
It returns a `DataSet` of vertices, where the vertex value corresponds to the community in which this vertex belongs after convergence.
The constructor takes one parameter:

* `maxIterations`: the maximum number of iterations to run.

## Connected Components

#### Overview
This is an implementation of the Weakly Connected Components algorithm. Upon convergence, two vertices belong to the
same component, if there is a path from one to the other, without taking edge direction into account.

#### Details
The algorithm is implemented using [scattergather iterations](#scattergatheriterations).
This implementation uses a comparable vertex value as initial component identifier (ID). Vertices propagate their
current value in each iteration. Upon receiving component IDs from its neighbors, a vertex adopts a new component ID if
its value is lower than its current component ID. The algorithm converges when vertices no longer update their component
ID value or when the maximum number of iterations has been reached.

#### Usage
The result is a `DataSet` of vertices, where the vertex value corresponds to the assigned component.
The constructor takes one parameter:

* `maxIterations`: the maximum number of iterations to run.

## GSA Connected Components

#### Overview
This is an implementation of the Weakly Connected Components algorithm. Upon convergence, two vertices belong to the
same component, if there is a path from one to the other, without taking edge direction into account.

#### Details
The algorithm is implemented using [gathersumapply iterations](#gathersumapplyiterations).
This implementation uses a comparable vertex value as initial component identifier (ID). In the gather phase, each
vertex collects the vertex value of their adjacent vertices. In the sum phase, the minimum among those values is
selected. In the apply phase, the algorithm sets the minimum value as the new vertex value if it is smaller than
the current value. The algorithm converges when vertices no longer update their component ID value or when the
maximum number of iterations has been reached.

#### Usage
The result is a `DataSet` of vertices, where the vertex value corresponds to the assigned component.
The constructor takes one parameter:

* `maxIterations`: the maximum number of iterations to run.

## PageRank

#### Overview
An implementation of a simple [PageRank algorithm](https://en.wikipedia.org/wiki/PageRank), using [scattergather iterations](#scattergatheriterations).
PageRank is an algorithm that was first used to rank web search engine results. Today, the algorithm and many variations, are used in various graph application domains. The idea of PageRank is that important or relevant pages tend to link to other important pages.

#### Details
The algorithm operates in iterations, where pages distribute their scores to their neighbors (pages they have links to) and subsequently update their scores based on the partial values they receive. The implementation assumes that each page has at least one incoming and one outgoing link.
In order to consider the importance of a link from one page to another, scores are divided by the total number of outlinks of the source page. Thus, a page with 10 links will distribute 1/10 of its score to each neighbor, while a page with 100 links, will distribute 1/100 of its score to each neighboring page. This process computes what is often called the transition probablities, i.e. the probability that some page will lead to other page while surfing the web. To correctly compute the transition probabilities, this implementation expects the edge values to be initialised to 1.0.

#### Usage
The algorithm takes as input a `Graph` with any vertex type, `Double` vertex values, and `Double` edge values. Edges values should be initialized to 1.0, in order to correctly compute the transition probabilities. Otherwise, the transition probability for an Edge `(u, v)` will be set to the edge value divided by `u`'s outdegree. The algorithm returns a `DataSet` of vertices, where the vertex value corresponds to assigned rank after convergence (or maximum iterations).
The constructors take the following parameters:

* `beta`: the damping factor.
* `maxIterations`: the maximum number of iterations to run.

## GSA PageRank

The algorithm is implemented using [gathersumapply iterations](#gathersumapplyiterations).

See the [PageRank](#pagerank) library method for implementation details and usage information.

## Single Source Shortest Paths

#### Overview
An implementation of the SingleSourceShortestPaths algorithm for weighted graphs. Given a source vertex, the algorithm computes the shortest paths from this source to all other nodes in the graph.

#### Details
The algorithm is implemented using [scattergather iterations](#scattergatheriterations).
In each iteration, a vertex sends to its neighbors a message containing the sum its current distance and the edge weight connecting this vertex with the neighbor. Upon receiving candidate distance messages, a vertex calculates the minimum distance and, if a shorter path has been discovered, it updates its value. If a vertex does not change its value during a superstep, then it does not produce messages for its neighbors for the next superstep. The computation terminates after the specified maximum number of supersteps or when there are no value updates.

#### Usage
The algorithm takes as input a `Graph` with any vertex type, `Double` vertex values, and `Double` edge values. The output is a `DataSet` of vertices where the vertex values
correspond to the minimum distances from the given source vertex.
The constructor takes two parameters:

* `srcVertexId` The vertex ID of the source vertex.
* `maxIterations`: the maximum number of iterations to run.

## GSA Single Source Shortest Paths

The algorithm is implemented using [gathersumapply iterations](#gathersumapplyiterations).

See the [Single Source Shortest Paths](#singlesourceshortestpaths) library method for implementation details and usage information.

## Triangle Count

#### Overview
An analytic for counting the number of unique triangles in a graph.

#### Details
Counts the triangles generated by [Triangle Listing](#trianglelisting).

#### Usage
The analytic takes an undirected graph as input and returns as a result a `Long` corresponding to the number of triangles
in the graph. The graph ID type must be `Comparable` and `Copyable`.

## Triangle Listing

This algorithm supports object reuse. The graph ID type must be `Comparable` and `Copyable`.

See the [Triangle Enumerator](#triangleenumerator) library method for implementation details.

## Triangle Enumerator

#### Overview
This library method enumerates unique triangles present in the input graph. A triangle consists of three edges that connect three vertices with each other.
This implementation ignores edge directions.

#### Details
The basic triangle enumeration algorithm groups all edges that share a common vertex and builds triads, i.e., triples of vertices
that are connected by two edges. Then, all triads are filtered for which no third edge exists that closes the triangle.
For a group of <i>n</i> edges that share a common vertex, the number of built triads is quadratic <i>((n*(n1))/2)</i>.
Therefore, an optimization of the algorithm is to group edges on the vertex with the smaller output degree to reduce the number of triads.
This implementation extends the basic algorithm by computing output degrees of edge vertices and grouping on edges on the vertex with the smaller degree.

#### Usage
The algorithm takes a directed graph as input and outputs a `DataSet` of `Tuple3`. The Vertex ID type has to be `Comparable`.
Each `Tuple3` corresponds to a triangle, with the fields containing the IDs of the vertices forming the triangle.

## HyperlinkInduced Topic Search

#### Overview
[HyperlinkInduced Topic Search](http://www.cs.cornell.edu/home/kleinber/auth.pdf) (HITS, or "Hubs and Authorities")
computes two interdependent scores for every vertex in a directed graph. Good hubs are those which point to many
good authorities and good authorities are those pointed to by many good hubs.

#### Details
Every vertex is assigned the same initial hub and authority scores. The algorithm then iteratively updates the scores
until termination. During each iteration new hub scores are computed from the authority scores, then new authority
scores are computed from the new hub scores. The scores are then normalized and optionally tested for convergence.

#### Usage
The algorithm takes a directed graph as input and outputs a `DataSet` of `Tuple3` containing the vertex ID, hub score,
and authority score.

## Summarization

#### Overview
The summarization algorithm computes a condensed version of the input graph by grouping vertices and edges based on
their values. In doing so, the algorithm helps to uncover insights about patterns and distributions in the graph.
One possible use case is the visualization of communities where the whole graph is too large and needs to be summarized
based on the community identifier stored at a vertex.

#### Details
In the resulting graph, each vertex represents a group of vertices that share the same value. An edge, that connects a
vertex with itself, represents all edges with the same edge value that connect vertices from the same vertex group. An
edge between different vertices in the output graph represents all edges with the same edge value between members of
different vertex groups in the input graph.

The algorithm is implemented using Flink data operators. First, vertices are grouped by their value and a representative
is chosen from each group. For any edge, the source and target vertex identifiers are replaced with the corresponding
representative and grouped by source, target and edge value. Output vertices and edges are created from their
corresponding groupings.

#### Usage
The algorithm takes a directed, vertex (and possibly edge) attributed graph as input and outputs a new graph where each
vertex represents a group of vertices and each edge represents a group of edges from the input graph. Furthermore, each
vertex and edge in the output graph stores the common group value and the number of represented elements.

## AdamicAdar

#### Overview
AdamicAdar measures the similarity between pairs of vertices as the sum of the inverse logarithm of degree over shared
neighbors. Scores are nonnegative and unbounded. A vertex with higher degree has greater overall influence but is less
influential to each pair of neighbors.

#### Details
The algorithm first annotates each vertex with the inverse of the logarithm of the vertex degree then joins this score
onto edges by source vertex. Grouping on the source vertex, each pair of neighbors is emitted with the vertex score.
Grouping on twopaths, the AdamicAdar score is summed.

See the [Jaccard Index](#jaccardindex) library method for a similar algorithm.

#### Usage
The algorithm takes a simple, undirected graph as input and outputs a `DataSet` of tuples containing two vertex IDs and
the AdamicAdair similarity score. The graph ID type must be `Comparable` and `Copyable`.

* `setLittleParallelism`: override the parallelism of operators processing small amounts of data
* `setMinimumRatio`: filter out AdamicAdar scores less than the given ratio times the average score
* `setMinimumScore`: filter out AdamicAdar scores less than the given minimum

## Jaccard Index

#### Overview
The Jaccard Index measures the similarity between vertex neighborhoods and is computed as the number of shared neighbors
divided by the number of distinct neighbors. Scores range from 0.0 (no shared neighbors) to 1.0 (all neighbors are
shared).

#### Details
Counting shared neighbors for pairs of vertices is equivalent to counting connecting paths of length two. The number of
distinct neighbors is computed by storing the sum of degrees of the vertex pair and subtracting the count of shared
neighbors, which are doublecounted in the sum of degrees.

The algorithm first annotates each edge with the target vertex's degree. Grouping on the source vertex, each pair of
neighbors is emitted with the degree sum. Grouping on twopaths, the shared neighbors are counted.

#### Usage
The algorithm takes a simple, undirected graph as input and outputs a `DataSet` of tuples containing two vertex IDs,
the number of shared neighbors, and the number of distinct neighbors. The result class provides a method to compute the
Jaccard Index score. The graph ID type must be `Comparable` and `Copyable`.

* `setLittleParallelism`: override the parallelism of operators processing small amounts of data
* `setMaximumScore`: filter out Jaccard Index scores greater than or equal to the given maximum fraction
* `setMinimumScore`: filter out Jaccard Index scores less than the given minimum fraction

## Local Clustering Coefficient

#### Overview
The local clustering coefficient measures the connectedness of each vertex's neighborhood. Scores range from 0.0 (no
edges between neighbors) to 1.0 (neighborhood is a clique).

#### Details
An edge between a vertex's neighbors is a triangle. Counting edges between neighbors is equivalent to counting the
number of triangles which include the vertex. The clustering coefficient score is the number of edges between neighbors
divided by the number of potential edges between neighbors.

See the [Triangle Enumeration](#triangleenumeration) library method for a detailed explanation of triangle enumeration.

#### Usage
Directed and undirected variants are provided. The algorithms take a simple graph as input and output a `DataSet` of
tuples containing the vertex ID, vertex degree, and number of triangles containing the vertex. The graph ID type must be
`Comparable` and `Copyable`.

## Global Clustering Coefficient

#### Overview
The global clustering coefficient measures the connectedness of a graph. Scores range from 0.0 (no edges between
neighbors) to 1.0 (complete graph).

#### Details
See the [Local Clustering Coefficient](#localclusteringcoefficient) library method for a detailed explanation of
clustering coefficient.

#### Usage
Directed and undirected variants are provided. The algorithm takes a simple graph as input and outputs a result
containing the total number of triplets and triangles in the graph. The graph ID type must be `Comparable` and
`Copyable`.


{% top %}
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/index.md

diff git a/docs/apis/batch/libs/index.md b/docs/apis/batch/libs/index.md
deleted file mode 100644
index cb32108..0000000
 a/docs/apis/batch/libs/index.md
+++ /dev/null
@@ 1,29 +0,0 @@

title: "Libraries"
subnavgroup: batch
subnavid: libs
subnavpos: 6
subnavtitle: Libraries

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

 Graph processing: [Gelly](gelly/index.html)
 Machine Learning: [FlinkML](ml/index.html)
 Relational Queries: [Table and SQL](table.html)
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/ml/als.md

diff git a/docs/apis/batch/libs/ml/als.md b/docs/apis/batch/libs/ml/als.md
deleted file mode 100644
index 5cfa159..0000000
 a/docs/apis/batch/libs/ml/als.md
+++ /dev/null
@@ 1,178 +0,0 @@

mathjax: include
title: Alternating Least Squares

# Sub navigation
subnavgroup: batch
subnavparent: flinkml
subnavtitle: ALS

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

* This will be replaced by the TOC
{:toc}

## Description

The alternating least squares (ALS) algorithm factorizes a given matrix $R$ into two factors $U$ and $V$ such that $R \approx U^TV$.
The unknown row dimension is given as a parameter to the algorithm and is called latent factors.
Since matrix factorization can be used in the context of recommendation, the matrices $U$ and $V$ can be called user and item matrix, respectively.
The $i$th column of the user matrix is denoted by $u_i$ and the $i$th column of the item matrix is $v_i$.
The matrix $R$ can be called the ratings matrix with $$(R)_{i,j} = r_{i,j}$$.

In order to find the user and item matrix, the following problem is solved:

$$\arg\min_{U,V} \sum_{\{i,j\mid r_{i,j} \not= 0\}} \left(r_{i,j}  u_{i}^Tv_{j}\right)^2 +
\lambda \left(\sum_{i} n_{u_i} \left\lVert u_i \right\rVert^2 + \sum_{j} n_{v_j} \left\lVert v_j \right\rVert^2 \right)$$

with $\lambda$ being the regularization factor, $$n_{u_i}$$ being the number of items the user $i$ has rated and $$n_{v_j}$$ being the number of times the item $j$ has been rated.
This regularization scheme to avoid overfitting is called weighted$\lambda$regularization.
Details can be found in the work of [Zhou et al.](http://dx.doi.org/10.1007/9783540688808_32).

By fixing one of the matrices $U$ or $V$, we obtain a quadratic form which can be solved directly.
The solution of the modified problem is guaranteed to monotonically decrease the overall cost function.
By applying this step alternately to the matrices $U$ and $V$, we can iteratively improve the matrix factorization.

The matrix $R$ is given in its sparse representation as a tuple of $(i, j, r)$ where $i$ denotes the row index, $j$ the column index and $r$ is the matrix value at position $(i,j)$.

## Operations

`ALS` is a `Predictor`.
As such, it supports the `fit` and `predict` operation.

### Fit

ALS is trained on the sparse representation of the rating matrix:

* `fit: DataSet[(Int, Int, Double)] => Unit`

### Predict

ALS predicts for each tuple of row and column index the rating:

* `predict: DataSet[(Int, Int)] => DataSet[(Int, Int, Double)]`

## Parameters

The alternating least squares implementation can be controlled by the following parameters:

 <table class="table tablebordered">
 <thead>
 <tr>
 <th class="textleft" style="width: 20%">Parameters</th>
 <th class="textcenter">Description</th>
 </tr>
 </thead>

 <tbody>
 <tr>
 <td><strong>NumFactors</strong></td>
 <td>
 <p>
 The number of latent factors to use for the underlying model.
 It is equivalent to the dimension of the calculated user and item vectors.
 (Default value: <strong>10</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>Lambda</strong></td>
 <td>
 <p>
 Regularization factor. Tune this value in order to avoid overfitting or poor performance due to strong generalization.
 (Default value: <strong>1</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>Iterations</strong></td>
 <td>
 <p>
 The maximum number of iterations.
 (Default value: <strong>10</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>Blocks</strong></td>
 <td>
 <p>
 The number of blocks into which the user and item matrix are grouped.
 The fewer blocks one uses, the less data is sent redundantly.
 However, bigger blocks entail bigger update messages which have to be stored on the heap.
 If the algorithm fails because of an OutOfMemoryException, then try to increase the number of blocks.
 (Default value: <strong>None</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>Seed</strong></td>
 <td>
 <p>
 Random seed used to generate the initial item matrix for the algorithm.
 (Default value: <strong>0</strong>)
 </p>
 </td>
 </tr>
 <tr>
 <td><strong>TemporaryPath</strong></td>
 <td>
 <p>
 Path to a temporary directory into which intermediate results are stored.
 If this value is set, then the algorithm is split into two preprocessing steps, the ALS iteration and a postprocessing step which calculates a last ALS halfstep.
 The preprocessing steps calculate the <code>OutBlockInformation</code> and <code>InBlockInformation</code> for the given rating matrix.
 The results of the individual steps are stored in the specified directory.
 By splitting the algorithm into multiple smaller steps, Flink does not have to split the available memory amongst too many operators.
 This allows the system to process bigger individual messages and improves the overall performance.
 (Default value: <strong>None</strong>)
 </p>
 </td>
 </tr>
 </tbody>
 </table>

## Examples

{% highlight scala %}
// Read input data set from a csv file
val inputDS: DataSet[(Int, Int, Double)] = env.readCsvFile[(Int, Int, Double)](
 pathToTrainingFile)

// Setup the ALS learner
val als = ALS()
.setIterations(10)
.setNumFactors(10)
.setBlocks(100)
.setTemporaryPath("hdfs://tempPath")

// Set the other parameters via a parameter map
val parameters = ParameterMap()
.add(ALS.Lambda, 0.9)
.add(ALS.Seed, 42L)

// Calculate the factorization
als.fit(inputDS, parameters)

// Read the testing data set from a csv file
val testingDS: DataSet[(Int, Int)] = env.readCsvFile[(Int, Int)](pathToData)

// Calculate the ratings according to the matrix factorization
val predictedRatings = als.predict(testingDS)
{% endhighlight %}
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/ml/contribution_guide.md

diff git a/docs/apis/batch/libs/ml/contribution_guide.md b/docs/apis/batch/libs/ml/contribution_guide.md
deleted file mode 100644
index 30df530..0000000
 a/docs/apis/batch/libs/ml/contribution_guide.md
+++ /dev/null
@@ 1,110 +0,0 @@

mathjax: include
title: How to Contribute

# Sub navigation
subnavgroup: batch
subnavparent: flinkml
subnavtitle: How To Contribute

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

The Flink community highly appreciates all sorts of contributions to FlinkML.
FlinkML offers people interested in machine learning to work on a highly active open source project which makes scalable ML reality.
The following document describes how to contribute to FlinkML.

* This will be replaced by the TOC
{:toc}

## Getting Started

In order to get started first read Flink's [contribution guide](http://flink.apache.org/howtocontribute.html).
Everything from this guide also applies to FlinkML.

## Pick a Topic

If you are looking for some new ideas you should first look into our [roadmap](https://cwiki.apache.org/confluence/display/FLINK/FlinkML%3A+Vision+and+Roadmap), then you should check out the list of [unresolved issues on JIRA](https://issues.apache.org/jira/issues/?jql=component%20%3D%20%22Machine%20Learning%20Library%22%20AND%20project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC).
Once you decide to contribute to one of these issues, you should take ownership of it and track your progress with this issue.
That way, the other contributors know the state of the different issues and redundant work is avoided.

If you already know what you want to contribute to FlinkML all the better.
It is still advisable to create a JIRA issue for your idea to tell the Flink community what you want to do, though.

## Testing

New contributions should come with tests to verify the correct behavior of the algorithm.
The tests help to maintain the algorithm's correctness throughout code changes, e.g. refactorings.

We distinguish between unit tests, which are executed during Maven's test phase, and integration tests, which are executed during maven's verify phase.
Maven automatically makes this distinction by using the following naming rules:
All test cases whose class name ends with a suffix fulfilling the regular expression `(ITIntegration)(TestSuiteCase)`, are considered integration tests.
The rest are considered unit tests and should only test behavior which is local to the component under test.

An integration test is a test which requires the full Flink system to be started.
In order to do that properly, all integration test cases have to mix in the trait `FlinkTestBase`.
This trait will set the right `ExecutionEnvironment` so that the test will be executed on a special `FlinkMiniCluster` designated for testing purposes.
Thus, an integration test could look the following:

{% highlight scala %}
class ExampleITSuite extends FlatSpec with FlinkTestBase {
 behavior of "An example algorithm"

 it should "do something" in {
 ...
 }
}
{% endhighlight %}

The test style does not have to be `FlatSpec` but can be any other scalatest `Suite` subclass.
See [ScalaTest testing styles](http://scalatest.org/user_guide/selecting_a_style) for more information.

## Documentation

When contributing new algorithms, it is required to add code comments describing the way the algorithm works and its parameters with which the user can control its behavior.
Additionally, we would like to encourage contributors to add this information to the online documentation.
The online documentation for FlinkML's components can be found in the directory `docs/libs/ml`.

Every new algorithm is described by a single markdown file.
This file should contain at least the following points:

1. What does the algorithm do
2. How does the algorithm work (or reference to description)
3. Parameter description with default values
4. Code snippet showing how the algorithm is used

In order to use latex syntax in the markdown file, you have to include `mathjax: include` in the YAML front matter.

{% highlight java %}

mathjax: include
htmlTitle: FlinkML  Example title
title: <a href="../ml">FlinkML</a>  Example title

{% endhighlight %}

In order to use displayed mathematics, you have to put your latex code in `$$ ... $$`.
For inline mathematics, use `$ ... $`.
Additionally some predefined latex commands are included into the scope of your markdown file.
See `docs/_include/latex_commands.html` for the complete list of predefined latex commands.

## Contributing

Once you have implemented the algorithm with adequate test coverage and added documentation, you are ready to open a pull request.
Details of how to open a pull request can be found [here](http://flink.apache.org/howtocontribute.html#contributingcodedocumentation).
http://gitwipus.apache.org/repos/asf/flink/blob/844c874b/docs/apis/batch/libs/ml/cross_validation.md

diff git a/docs/apis/batch/libs/ml/cross_validation.md b/docs/apis/batch/libs/ml/cross_validation.md
deleted file mode 100644
index 2473317..0000000
 a/docs/apis/batch/libs/ml/cross_validation.md
+++ /dev/null
@@ 1,175 +0,0 @@

mathjax: include
title: Cross Validation

# Sub navigation
subnavgroup: batch
subnavparent: flinkml
subnavtitle: Cross Validation

<!
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
>

* This will be replaced by the TOC
{:toc}

## Description

 A prevalent problem when utilizing machine learning algorithms is *overfitting*, or when an algorithm "memorizes" the training data but does a poor job extrapolating to out of sample cases. A common method for dealing with the overfitting problem is to hold back some subset of data from the original training algorithm and then measure the fit algorithm's performance on this holdout set. This is commonly known as *cross validation*. A model is trained on one subset of data and then *validated* on another set of data.

## Cross Validation Strategies

There are several strategies for holding out data. FlinkML has convenience methods for
 TrainTest Splits
 TrainTestHoldout Splits
 KFold Splits
 MultiRandom Splits

### TrainTest Splits

The simplest method of splitting is the `trainTestSplit`. This split takes a DataSet and a parameter *fraction*. The *fraction* indicates the portion of the DataSet that should be allocated to the training set. This split also takes two additional optional parameters, *precise* and *seed*.

By default, the Split is done by randomly deciding whether or not an observation is assigned to the training DataSet with probability = *fraction*. When *precise* is `true` however, additional steps are taken to ensure the training set is as close as possible to the length of the DataSet $\cdot$ *fraction*.

The method returns a new `TrainTestDataSet` object which has a `.training` attribute containing the training DataSet and a `.testing` attribute containing the testing DataSet.


### TrainTestHoldout Splits

In some cases, algorithms have been known to 'learn' the testing set. To combat this issue, a traintesthold out strategy introduces a secondary holdout set, aptly called the *holdout* set.

Traditionally, training and testing would be done to train an algorithms as normal and then a final test of the algorithm on the holdout set would be done. Ideally, prediction errors/model scores in the holdout set would not be significantly different than those observed in the testing set.

In a traintestholdout strategy we sacrifice the sample size of the initial fitting algorithm for increased confidence that our model is not overfit.

When using `trainTestHoldout` splitter, the *fraction* `Double` is replaced by a *fraction* array of length three. The first element coresponds to the portion to be used for training, second for testing, and third for holdout. The weights of this array are *relative*, e.g. an array `Array(3.0, 2.0, 1.0)` would results in approximately 50% of the observations being in the training set, 33% of the observations in the testing set, and 17% of the observations in holdout set.

### KFold Splits

In a *kfold* strategy, the DataSet is split into *k* equal subsets. Then for each of the *k* subsets, a `TrainTestDataSet` is created where the subset is the `.training` DataSet, and the remaining subsets are the `.testing` set.

For each training set, an algorithm is trained and then is evaluated based on the predictions based on the associated testing set. When an algorithm that has consistent grades (e.g. prediction errors) across held out datasets we can have some confidence that our approach (e.g. choice of algorithm / algorithm parameters / number of iterations) is robust against overfitting.

<a href="https://en.wikipedia.org/wiki/Crossvalidation_(statistics)#kfold_crossvalidation">KFold Cross Validatation</a>

### MultiRandom Splits

The *multirandom* strategy can be thought of as a more general form of the *traintestholdout* strategy. In fact, `.trainTestHoldoutSplit` is a simple wrapper for `multiRandomSplit` which also packages the datasets into a `trainTestHoldoutDataSet` object.

The first major difference, is that `multiRandomSplit` takes an array of fractions of any length. E.g. one can create multiple holdout sets. Alternatively, one could think of `kFoldSplit` as a wrapper for `multiRandomSplit` (which it is), the difference being `kFoldSplit` creates subsets of approximately equal size, where `multiRandomSplit` will create subsets of any size.

The second major difference is that `multiRandomSplit` returns an array of DataSets, equal in size and proportion to the *fraction array* that it was passed as an argument.

## Parameters

The various `Splitter` methods share many parameters.

 <table class="table tablebordered">
 <thead>
 <tr>
 <th class="textleft" style="width: 20%">Parameter</th>
 <th class="textcenter">Type</th>
 <th class="textcenter">Description</th>
 <th class="textright">Used by Method</th>
 </tr>
 </thead>

 <tbody>
 <tr>
 <td><code>input</code></td>
 <td><code>DataSet[Any]</code></td>
 <td>DataSet to be split.</td>
 <td>
 <code>randomSplit</code><br>
 <code>multiRandomSplit</code><br>
 <code>kFoldSplit</code><br>
 <code>trainTestSplit</code><br>
 <code>trainTestHoldoutSplit</code>
 </td>
 </tr>
 <tr>
 <td><code>seed</code></td>
 <td><code>Long</code></td>
 <td>
 <p>
 Used for seeding the random number generator which sorts DataSets into other DataSets.
 </p>
 </td>
 <td>
 <code>randomSplit</code><br>
 <code>multiRandomSplit</code><br>
 <code>kFoldSplit</code><br>
 <code>trainTestSplit</code><br>
 <code>trainTestHoldoutSplit</code>
 </td>
 </tr>
 <tr>
 <td><code>precise</code></td>
 <td><code>Boolean</code></td>
 <td>When true, make additional effort to make DataSets as close to the prescribed proportions as possible.</td>
 <td>
 <code>randomSplit</code><br>
 <code>trainTestSplit</code>
 </td>
 </tr>
 <tr>
 <td><code>fraction</code></td>
 <td><code>Double</code></td>
 <td>The portion of the `input` to assign to the first or <code>.training</code> DataSet. Must be in the range (0,1)</td>
 <td><code>randomSplit</code><br>
 <code>trainTestSplit</code>
 </td>
 </tr>
 <tr>
 <td><code>fracArray</code></td>
 <td><code>Array[Double]</code></td>
 <td>An array that prescribes the proportions of the output datasets (proportions need not sum to 1 or be within the range (0,1))</td>
 <td>
 <code>multiRandomSplit</code><br>
 <code>trainTestHoldoutSplit</code>
 </td>
 </tr>
 <tr>
 <td><code>kFolds</code></td>
 <td><code>Int</code></td>
 <td>The number of subsets to break the <code>input</code> DataSet into.</td>
 <td><code>kFoldSplit</code></td>
 </tr>

 </tbody>
</table>

## Examples

{% highlight scala %}
// An input dataset does not have to be of type LabeledVector
val data: DataSet[LabeledVector] = ...

// A Simple TrainTestSplit
val dataTrainTest: TrainTestDataSet = Splitter.trainTestSplit(data, 0.6, true)

// Create a train test holdout DataSet
val dataTrainTestHO: trainTestHoldoutDataSet = Splitter.trainTestHoldoutSplit(data, Array(6.0, 3.0, 1.0))

// Create an Array of K TrainTestDataSets
val dataKFolded: Array[TrainTestDataSet] = Splitter.kFoldSplit(data, 10)

// create an array of 5 datasets
val dataMultiRandom: Array[DataSet[T]] = Splitter.multiRandomSplit(data, Array(0.5, 0.1, 0.1, 0.1, 0.1))
{% endhighlight %}
\ No newline at end of file
