flink-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rmetz...@apache.org
Subject [23/53] [abbrv] [FLINK-962] Initial import of documentation from website into source code (closes #34)
Date Thu, 26 Jun 2014 09:46:48 GMT
http://git-wip-us.apache.org/repos/asf/incubator-flink/blob/40b94f73/docs/spargel_guide.md
----------------------------------------------------------------------
diff --git a/docs/spargel_guide.md b/docs/spargel_guide.md
new file mode 100644
index 0000000..5766f8b
--- /dev/null
+++ b/docs/spargel_guide.md
@@ -0,0 +1,112 @@
+---
+title: "Spargel Graph Processing API"
+---
+
+Spargel
+=======
+
+Spargel is our [Giraph](http://giraph.apache.org) like **graph processing** Java API. It
supports basic graph computations, which are run as a sequence of [supersteps]({{ site.baseurl
}}/docs/0.4/programming_guides/iterations.html#supersteps). Spargel and Giraph both implement
the [Bulk Synchronous Parallel (BSP)](https://en.wikipedia.org/wiki/Bulk_Synchronous_Parallel)
programming model, propsed by Google's [Pregel](http://googleresearch.blogspot.de/2009/06/large-scale-graph-computing-at-google.html).
+
+The API provides a **vertex-centric** view on graph processing with two basic operations
per superstep:
+
+  1. **Send messages** to other vertices, and
+  2. **Receive messages** from other vertices and **update own vertex state**.
+
+This vertex-centric view makes it easy to express a large class of graph problems efficiently.
We will list all *relevant interfaces* of the **Spargel API** to implement and walk through
an **example Spargel program**.
+
+Spargel API
+-----------
+
+The Spargel API is part of the *addons* Maven project. All relevant classes are located in
the *eu.stratosphere.spargel.java* package.
+
+Add the following dependency to your `pom.xml` to use the Spargel.
+
+```xml
+<dependency>
+	<groupId>eu.stratosphere</groupId>
+	<artifactId>spargel</artifactId>
+	<version>{{site.current_stable}}</version>
+</dependency>
+```
+
+Extend **VertexUpdateFunction&lt;***VertexKeyType*, *VertexValueType*, *MessageType***&gt;**
to implement your *custom vertex update logic*.
+
+Extend **MessagingFunction&lt;***VertexKeyType*, *VertexValueType*, *MessageType*, *EdgeValueType***&gt;**
to implement your *custom message logic*.
+
+Create a **SpargelIteration** operator to include Spargel in your data flow.
+
+Example: Propagate Minimum Vertex ID in Graph
+---------------------------------------------
+
+The Spargel operator **SpargelIteration** includes Spargel graph processing into your data
flow. As usual, it can be combined with other operators like *map*, *reduce*, *join*, etc.
+
+{% highlight java %}
+FileDataSource vertices = new FileDataSource(...);
+FileDataSource edges = new FileDataSource(...);
+
+SpargelIteration iteration = new SpargelIteration(new MinMessager(), new MinNeighborUpdater());
+iteration.setVertexInput(vertices);
+iteration.setEdgesInput(edges);
+iteration.setNumberOfIterations(maxIterations);
+
+FileDataSink result = new FileDataSink(...);
+result.setInput(iteration.getOutput());
+
+new Plan(result);
+{% endhighlight %}
+Besides the **program logic** of vertex updates in *MinNeighborUpdater* and messages in *MinMessager*,
you have to specify the **initial vertex** and **edge input**. Every vertex has a **key**
and **value**. In each superstep, it **receives messages** from other vertices and updates
its state:
+
+  - **Vertex** input: **(id**: *VertexKeyType*, **value**: *VertexValueType***)**
+  - **Edge** input: **(source**: *VertexKeyType*, **target**: *VertexKeyType*[, **value**:
*EdgeValueType*])
+
+For our example, we set the vertex ID as both *id and value* (initial minimum) and *leave
out the edge values* as we don't need them:
+
+<p class="text-center">
+    <img alt="Spargel Example Input" width="75%" src="{{ site.baseurl }}/docs/0.4/img/spargel_example_input.png"
/>
+</p>
+
+In order to **propagate the minimum vertex ID**, we iterate over all received messages (which
contain the neighboring IDs) and update our value, if we found a new minimum:
+
+{% highlight java %}
+public class MinNeighborUpdater extends VertexUpdateFunction<IntValue, IntValue, IntValue>
{
+	
+	@Override
+	public void updateVertex(IntValue id, IntValue currentMin, Iterator<IntValue> messages)
{
+		int min = Integer.MAX_VALUE;
+
+		// iterate over all received messages
+		while (messages.hasNext()) {
+			int next = messages.next().getValue();
+			min = next < min ? next : min;
+		}
+
+		// update vertex value, if new minimum
+		if (min < currentMin.getValue()) {
+			setNewVertexValue(new IntValue(min));
+		}
+	}
+}
+{% endhighlight %}
+
+The **messages in each superstep** consist of the **current minimum ID** seen by the vertex:
+
+{% highlight java %}
+public class MinMessager extends MessagingFunction<IntValue, IntValue, IntValue, NullValue>
{
+	
+	@Override
+	public void sendMessages(IntValue id, IntValue currentMin) {
+		// send current minimum to neighbors
+		sendMessageToAllNeighbors(currentMin);
+    }
+}
+{% endhighlight %}
+
+The **API-provided method** `sendMessageToAllNeighbors(MessageType)` sends the message to
all neighboring vertices. It is also possible to address specific vertices with `sendMessageTo(VertexKeyType,
MessageType)`.
+
+If the value of a vertex does not change during a superstep, it will **not send** any messages
in the superstep. This allows to do incremental updates to the **hot (changing) parts** of
the graph, while leaving **cold (steady) parts** untouched.
+
+The computation **terminates** after a specified *maximum number of supersteps* **-OR-**
the *vertex states stop changing*.
+
+<p class="text-center">
+    <img alt="Spargel Example" width="75%" src="{{ site.baseurl }}/docs/0.4/img/spargel_example.png"
/>
+</p>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-flink/blob/40b94f73/docs/web_client.md
----------------------------------------------------------------------
diff --git a/docs/web_client.md b/docs/web_client.md
new file mode 100644
index 0000000..98cfd69
--- /dev/null
+++ b/docs/web_client.md
@@ -0,0 +1,53 @@
+---
+title:  "Web Client"
+---
+
+Stratosphere provides a web interface to upload jobs, inspect their execution plans, and
execute them. The interface is a great tool to showcase programs, debug execution plans, or
demonstrate the system as a whole.
+
+# Start, Stop, and Configure the Web Interface
+
+Start the web interface by executing:
+
+    ./bin/start-webclient.sh
+
+and stop it by calling:
+
+    ./bin/stop-webclient.sh
+
+The web interface runs on port 8080 by default. To specify a custom port set the ```webclient.port```
property in the *./conf/stratosphere.yaml* configuration file. Jobs are submitted to the JobManager
specified by ```jobmanager.rpc.address``` and ```jobmanager.rpc.port```. Please consult the
[configuration](../setup/config.html#web_frontend "Configuration") page for details and further
configuration options.
+
+# Use the Web Interface
+
+The web interface provides two views:
+
+1.  The **job view** to upload, preview, and submit Stratosphere programs.
+2.  The **plan view** to analyze the optimized execution plans of Stratosphere programs.
+
+## Job View
+
+The interface starts serving the job view. 
+
+You can **upload** a Stratosphere program as a jar file. To **execute** an uploaded program:
+
+* select it from the job list on the left, 
+* enter the program arguments in the *"Arguments"* field (bottom left), and 
+* click on the *"Run Job"* button (bottom right).
+
+If the *“Show optimizer plan”* option is enabled (default), the *plan view* is display
next, otherwise the job is directly submitted to the JobManager for execution.
+
+In case the jar's manifest file does not specify the program class, you can specify it before
the argument list as:
+
+```
+assembler <assemblerClass> <programArgs...>
+```
+
+## Plan View
+
+The plan view shows the optimized execution plan of the submitted program in the upper half
of the page. The bottom part of the page displays detailed information about the currently
selected plan operator including:
+
+* the chosen shipping strategies (local forward, hash partition, range partition, broadcast,
...),
+* the chosen local strategy (sort, hash join, merge join, ...),
+* inferred data properties (partitioning, grouping, sorting), and 
+* used optimizer estimates (data size, I/O and network costs, ...).
+
+To submit the job for execution, click again on the *"Run Job"* button in the bottom right.

http://git-wip-us.apache.org/repos/asf/incubator-flink/blob/40b94f73/docs/yarn_setup.md
----------------------------------------------------------------------
diff --git a/docs/yarn_setup.md b/docs/yarn_setup.md
new file mode 100644
index 0000000..c317e06
--- /dev/null
+++ b/docs/yarn_setup.md
@@ -0,0 +1,188 @@
+---
+title:  "YARN Setup"
+---
+
+# In a Nutshell
+
+Start YARN session with 4 Taskmanagers (each with 4 GB of Heapspace):
+
+```bash
+wget {{ site.docs_05_yarn_archive }}
+tar xvzf stratosphere-dist-{{ site.docs_05_stable }}-yarn.tar.gz
+cd stratosphere-yarn-{{ site.docs_05_stable }}/
+./bin/yarn-session.sh -n 4 -jm 1024 -tm 4096
+```
+
+# Introducing YARN
+
+Apache [Hadoop YARN](http://hadoop.apache.org/) is a cluster resource management framework.
It allows to run various distributed applications on top of a cluster. Stratosphere runs on
YARN next to other applications. Users do not have to setup or install anything if there is
already a YARN setup.
+
+**Requirements**
+
+- Apache Hadoop 2.2
+- HDFS
+
+If you have troubles using the Stratosphere YARN client, have a look in the [FAQ section]({{site.baseurl}}/docs/0.5/general/faq.html).
+
+## Start Stratosphere Session
+
+Follow these instructions to learn how to launch a Stratosphere Session within your YARN
cluster.
+
+A session will start all required Stratosphere services (JobManager and TaskManagers) so
that you can submit programs to the cluster. Note that you can run multiple programs per session.
+
+### Download Stratosphere for YARN
+
+Download the YARN tgz package on the [download page]({{site.baseurl}}/downloads/#nightly).
It contains the required files.
+
+
+If you want to build the YARN .tgz file from sources, follow the build instructions. Make
sure to use the `-Dhadoop.profile=2` profile. You can find the file in `stratosphere-dist/target/stratosphere-dist-{{site.docs_05_stable}}-yarn.tar.gz`
(*Note: The version might be different for you* ).
+
+Extract the package using:
+
+```bash
+tar xvzf stratosphere-dist-{{site.docs_05_stable}}-yarn.tar.gz
+cd stratosphere-yarn-{{site.docs_05_stable}}/
+```
+
+### Start a Session
+
+Use the following command to start a session
+
+```bash
+./bin/yarn-session.sh
+```
+
+This command will show you the following overview:
+
+```bash
+Usage:
+   Required
+     -n,--container <arg>   Number of Yarn container to allocate (=Number of TaskTrackers)
+   Optional
+     -jm,--jobManagerMemory <arg>    Memory for JobManager Container [in MB]
+     -q,--query                      Display available YARN resources (memory, cores)
+     -qu,--queue <arg>               Specify YARN queue.
+     -tm,--taskManagerMemory <arg>   Memory per TaskManager Container [in MB]
+     -tmc,--taskManagerCores <arg>   Virtual CPU cores per TaskManager
+     -v,--verbose                    Verbose debug mode
+```
+
+Please note that the Client requires the `HADOOP_HOME` (or `YARN_CONF_DIR` or `HADOOP_CONF_DIR`)
environment variable to be set to read the YARN and HDFS configuration.
+
+**Example:** Issue the following command to allocate 10 TaskTrackers, with 8 GB of memory
each:
+
+```bash
+./bin/yarn-session.sh -n 10 -tm 8192
+```
+
+The system will use the configuration in `conf/stratosphere-config.yaml`. Please follow our
[configuration guide]({{site.baseurl}}/docs/0.5/setup/config.html) if you want to change something.
Stratosphere on YARN will overwrite the following configuration parameters `jobmanager.rpc.address`
(because the JobManager is always allocated at different machines) and `taskmanager.tmp.dirs`
(we are using the tmp directories given by YARN).
+
+The example invocation starts 11 containers, since there is one additional container for
the ApplicationMaster and JobTracker.
+
+Once Stratosphere is deployed in your YARN cluster, it will show you the connection details
of the JobTracker.
+
+The client has to remain open to keep the deployment running. We suggest to use `screen`,
which will start a detachable shell:
+
+1. Open `screen`,
+2. Start Stratosphere on YARN,
+3. Use `CTRL+a`, then press `d` to detach the screen session,
+4. Use `screen -r` to resume again.
+
+# Submit Job to Stratosphere
+
+Use the following command to submit a Stratosphere program to the YARN cluster:
+
+```bash
+./bin/stratosphere
+```
+
+Please refer to the documentation of the [commandline client]({{site.baseurl}}/docs/0.5/program_execution/cli_client.html).
+
+The command will show you a help menu like this:
+
+```bash
+[...]
+Action "run" compiles and submits a Stratosphere program.
+  "run" action arguments:
+     -a,--arguments <programArgs>   Program arguments
+     -c,--class <classname>         Program class
+     -j,--jarfile <jarfile>         Stratosphere program JAR file
+     -m,--jobmanager <host:port>    Jobmanager to which the program is submitted
+     -w,--wait                      Wait for program to finish
+[...]
+```
+
+Use the *run* action to submit a job to YARN. The client is able to determine the address
of the JobManager. In the rare event of a problem, you can also pass the JobManager address
using the `-m` argument. The JobManager address is visible in the YARN console.
+
+**Example**
+
+```bash
+wget -O apache-license-v2.txt http://www.apache.org/licenses/LICENSE-2.0.txt
+
+./bin/stratosphere run -j ./examples/stratosphere-java-examples-{{site.docs_05_stable}}-WordCount.jar
\
+                       -a 1 file://`pwd`/apache-license-v2.txt file://`pwd`/wordcount-result.txt

+```
+
+If there is the following error, make sure that all TaskManagers started:
+
+```bash
+Exception in thread "main" eu.stratosphere.compiler.CompilerException:
+    Available instances could not be determined from job manager: Connection timed out.
+```
+
+You can check the number of TaskManagers in the JobManager web interface. The address of
this interface is printed in the YARN session console.
+
+If the TaskManagers do not show up after a minute, you should investigate the issue using
the log files.
+
+# Build Stratosphere for a specific Hadoop Version
+
+This section covers building Stratosphere for a specific Hadoop version. Most users do not
need to do this manually.
+The problem is that Stratosphere uses HDFS and YARN which are both from Apache Hadoop. There
exist many different builds of Hadoop (from both the upstream project and the different Hadoop
distributions). Typically errors arise with the RPC services. An error could look like this:
+
+```
+ERROR: The job was not successfully submitted to the nephele job manager:
+    eu.stratosphere.nephele.executiongraph.GraphConversionException: Cannot compute input
splits for TSV:
+    java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException:
+    Protocol message contained an invalid tag (zero).; Host Details :
+```
+
+**Example**
+
+```
+mvn -Dhadoop.profile=2 -Pcdh-repo -Dhadoop.version=2.2.0-cdh5.0.0-beta-2 -DskipTests package
+```
+
+The commands in detail:
+
+*  `-Dhadoop.profile=2` activates the Hadoop YARN profile of Stratosphere. This will enable
all components of Stratosphere that are compatible with Hadoop 2.2
+*  `-Pcdh-repo` activates the Cloudera Hadoop dependencies. If you want other vendor's Hadoop
dependencies (not in maven central) add the repository to your local maven configuration in
`~/.m2/`.
+* `-Dhadoop.version=2.2.0-cdh5.0.0-beta-2` sets a special version of the Hadoop dependencies.
Make sure that the specified Hadoop version is compatible with the profile you activated.
+
+If you want to build HDFS for Hadoop 2 without YARN, use the following parameter:
+
+```
+-P!include-yarn
+```
+
+Some Cloudera versions (such as `2.0.0-cdh4.2.0`) require this, since they have a new HDFS
version with the old YARN API.
+
+Please post to the [Stratosphere mailinglist](https://groups.google.com/d/forum/stratosphere-dev)
or create an issue on [Github](https://github.com/stratosphere/stratosphere/issues), if you
have issues with your YARN setup and Stratosphere.
+
+# Background
+
+This section briefly describes how Stratosphere and YARN interact. 
+
+<img src="{{site.baseurl}}/img/StratosphereOnYarn.svg" class="img-responsive">
+
+The YARN client needs to access the Hadoop configuration to connect to the YARN resource
manager and to HDFS. It determines the Hadoop configuration using the following strategy:
+
+* Test if `YARN_CONF_DIR`, `HADOOP_CONF_DIR` or `HADOOP_CONF_PATH` are set (in that order).
If one of these variables are set, they are used to read the configuration.
+* If the above strategy fails (this should not be the case in a correct YARN setup), the
client is using the `HADOOP_HOME` environment variable. If it is set, the client tries to
access `$HADOOP_HOME/etc/hadoop` (Hadoop 2) and `$HADOOP_HOME/conf` (Hadoop 1).
+
+When starting a new Stratosphere YARN session, the client first checks if the requested resources
(containers and memory) are available. After that, it uploads a jar that contains Stratosphere
and the configuration to HDFS (step 1).
+
+The next step of the client is to request (step 2) a YARN container to start the *ApplicationMaster*
(step 3). Since the client registered the configuration and jar-file as a resource for the
container, the NodeManager of YARN running on that particular machine will take care of preparing
the container (e.g. downloading the files). Once that has finished, the *ApplicationMaster*
(AM) is started.
+
+The *JobManager* and AM are running in the same container. Once they successfully started,
the AM knows the address of the JobManager (its own host). It is generating a new Stratosphere
configuration file for the TaskManagers (so that they can connect to the JobManager). The
file is also uploaded to HDFS. Additionally, the *AM* container is also serving Stratosphere's
web interface.
+
+After that, the AM starts allocating the containers for Stratosphere's TaskManagers, which
will download the jar file and the modified configuration from the HDFS. Once these steps
are completed, Stratosphere is set up and ready to accept Jobs.

http://git-wip-us.apache.org/repos/asf/incubator-flink/blob/40b94f73/pom.xml
----------------------------------------------------------------------
diff --git a/pom.xml b/pom.xml
index 96acd1d..4c29132 100644
--- a/pom.xml
+++ b/pom.xml
@@ -368,6 +368,8 @@
 						<exclude>**/*.iml</exclude>
 						<!-- Generated content -->
 						<exclude>**/target/**</exclude>
+						<!-- Documentation -->
+						<exclude>**/docs/**</exclude>
 					</excludes>
 				</configuration>
 			</plugin>


Mime
View raw message