flink-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From u..@apache.org
Subject [04/30] flink git commit: [docs] Change doc layout
Date Wed, 22 Apr 2015 14:17:01 GMT
http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/python_programming_guide.md
----------------------------------------------------------------------
diff --git a/docs/python_programming_guide.md b/docs/python_programming_guide.md
deleted file mode 100644
index 660086b..0000000
--- a/docs/python_programming_guide.md
+++ /dev/null
@@ -1,610 +0,0 @@
----
-title: "Python Programming Guide"
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-
-<a href="#top"></a>
-
-Introduction
-------------
-
-Analysis programs in Flink are regular programs that implement transformations on data sets
-(e.g., filtering, mapping, joining, grouping). The data sets are initially created from certain
-sources (e.g., by reading files, or from collections). Results are returned via sinks, which may for
-example write the data to (distributed) files, or to standard output (for example the command line
-terminal). Flink programs run in a variety of contexts, standalone, or embedded in other programs.
-The execution can happen in a local JVM, or on clusters of many machines.
-
-In order to create your own Flink program, we encourage you to start with the
-[program skeleton](#program-skeleton) and gradually add your own
-[transformations](#transformations). The remaining sections act as references for additional
-operations and advanced features.
-
-
-Example Program
----------------
-
-The following program is a complete, working example of WordCount. You can copy &amp; paste the code
-to run it locally.
-
-{% highlight python %}
-from flink.plan.Environment import get_environment
-from flink.plan.Constants import INT, STRING
-from flink.functions.GroupReduceFunction import GroupReduceFunction
-
-class Adder(GroupReduceFunction):
-  def reduce(self, iterator, collector):
-    count, word = iterator.next()
-    count += sum([x[0] for x in iterator])
-    collector.collect((count, word))
-
-if __name__ == "__main__":
-  env = get_environment()
-  data = env.from_elements("Who's there?",
-   "I think I hear them. Stand, ho! Who's there?")
-  
-  data \
-    .flat_map(lambda x, c: [(1, word) for word in x.lower().split()], (INT, STRING)) \
-    .group_by(1) \
-    .reduce_group(Adder(), (INT, STRING), combinable=True) \
-    .output()
-  
-  env.execute(local=True)
-{% endhighlight %}
-
-[Back to top](#top)
-
-Program Skeleton
-----------------
-
-As we already saw in the example, Flink programs look like regular python
-programs with a `if __name__ == "__main__":` block. Each program consists of the same basic parts:
-
-1. Obtain an `Environment`,
-2. Load/create the initial data,
-3. Specify transformations on this data,
-4. Specify where to put the results of your computations, and
-5. Execute your program.
-
-We will now give an overview of each of those steps but please refer to the respective sections for
-more details. 
-
-
-The `Environment` is the basis for all Flink programs. You can
-obtain one using these static methods on class `Environment`:
-
-{% highlight python %}
-get_environment()
-{% endhighlight %}
-
-For specifying data sources the execution environment has several methods
-to read from files. To just read a text file as a sequence of lines, you can use:
-
-{% highlight python %}
-env = get_environment()
-text = env.read_text("file:///path/to/file")
-{% endhighlight %}
-
-This will give you a DataSet on which you can then apply transformations. For
-more information on data sources and input formats, please refer to
-[Data Sources](#data-sources).
-
-Once you have a DataSet you can apply transformations to create a new
-DataSet which you can then write to a file, transform again, or
-combine with other DataSets. You apply transformations by calling
-methods on DataSet with your own custom transformation function. For example,
-a map transformation looks like this:
-
-{% highlight python %}
-data.map(lambda x: x*2, INT)
-{% endhighlight %}
-
-This will create a new DataSet by doubling every value in the original DataSet. 
-For more information and a list of all the transformations,
-please refer to [Transformations](#transformations).
-
-Once you have a DataSet that needs to be written to disk you can call one
-of these methods on DataSet:
-
-{% highlight python %}
-data.write_text("<file-path>", WriteMode=Constants.NO_OVERWRITE)
-write_csv("<file-path>", WriteMode=Constants.NO_OVERWRITE, line_delimiter='\n', field_delimiter=',')
-output()
-{% endhighlight %}
-
-The last method is only useful for developing/debugging on a local machine,
-it will output the contents of the DataSet to standard output. (Note that in
-a cluster, the result goes to the standard out stream of the cluster nodes and ends
-up in the *.out* files of the workers).
-The first two do as the name suggests. 
-Please refer to [Data Sinks](#data-sinks) for more information on writing to files.
-
-Once you specified the complete program you need to call `execute` on
-the `Environment`. This will either execute on your local machine or submit your program 
-for execution on a cluster, depending on how Flink was started. You can force
-a local execution by using `execute(local=True)`.
-
-[Back to top](#top)
-
-Project setup
----------------
-
-Apart from setting up Flink, no additional work is required. The python package can be found in the /resource folder of your Flink distribution. The flink package, along with the plan and optional packages are automatically distributed among the cluster via HDFS when running a job.
-
-The Python API was tested on Linux systems that have Python 2.7 or 3.4 installed.
-
-[Back to top](#top)
-
-Lazy Evaluation
----------------
-
-All Flink programs are executed lazily: When the program's main method is executed, the data loading
-and transformations do not happen directly. Rather, each operation is created and added to the
-program's plan. The operations are actually executed when one of the `execute()` methods is invoked
-on the Environment object. Whether the program is executed locally or on a cluster depends
-on the environment of the program.
-
-The lazy evaluation lets you construct sophisticated programs that Flink executes as one
-holistically planned unit.
-
-[Back to top](#top)
-
-
-Transformations
----------------
-
-Data transformations transform one or more DataSets into a new DataSet. Programs can combine
-multiple transformations into sophisticated assemblies.
-
-This section gives a brief overview of the available transformations. The [transformations
-documentation](dataset_transformations.html) has a full description of all transformations with
-examples.
-
-<br />
-
-<table class="table table-bordered">
-  <thead>
-    <tr>
-      <th class="text-left" style="width: 20%">Transformation</th>
-      <th class="text-center">Description</th>
-    </tr>
-  </thead>
-
-  <tbody>
-    <tr>
-      <td><strong>Map</strong></td>
-      <td>
-        <p>Takes one element and produces one element.</p>
-{% highlight python %}
-data.map(lambda x: x * 2, INT)
-{% endhighlight %}
-      </td>
-    </tr>
-
-    <tr>
-      <td><strong>FlatMap</strong></td>
-      <td>
-        <p>Takes one element and produces zero, one, or more elements. </p>
-{% highlight python %}
-data.flat_map(
-  lambda x,c: [(1,word) for word in line.lower().split() for line in x],
-  (INT, STRING))
-{% endhighlight %}
-      </td>
-    </tr>
-
-    <tr>
-      <td><strong>MapPartition</strong></td>
-      <td>
-        <p>Transforms a parallel partition in a single function call. The function get the partition
-        as an `Iterator` and can produce an arbitrary number of result values. The number of
-        elements in each partition depends on the degree-of-parallelism and previous operations.</p>
-{% highlight python %}
-data.map_partition(lambda x,c: [value * 2 for value in x], INT)
-{% endhighlight %}
-      </td>
-    </tr>
-
-    <tr>
-      <td><strong>Filter</strong></td>
-      <td>
-        <p>Evaluates a boolean function for each element and retains those for which the function
-        returns true.</p>
-{% highlight python %}
-data.filter(lambda x: x > 1000)
-{% endhighlight %}
-      </td>
-    </tr>
-
-    <tr>
-      <td><strong>Reduce</strong></td>
-      <td>
-        <p>Combines a group of elements into a single element by repeatedly combining two elements
-        into one. Reduce may be applied on a full data set, or on a grouped data set.</p>
-{% highlight python %}
-data.reduce(lambda x,y : x + y)
-{% endhighlight %}
-      </td>
-    </tr>
-
-    <tr>
-      <td><strong>ReduceGroup</strong></td>
-      <td>
-        <p>Combines a group of elements into one or more elements. ReduceGroup may be applied on a
-        full data set, or on a grouped data set.</p>
-{% highlight python %}
-class Adder(GroupReduceFunction):
-  def reduce(self, iterator, collector):
-    count, word = iterator.next()
-    count += sum([x[0] for x in iterator)      
-    collector.collect((count, word))
-
-data.reduce_group(Adder(), (INT, STRING))
-{% endhighlight %}
-      </td>
-    </tr>
-
-    </tr>
-      <td><strong>Join</strong></td>
-      <td>
-        Joins two data sets by creating all pairs of elements that are equal on their keys.
-        Optionally uses a JoinFunction to turn the pair of elements into a single element. 
-        See <a href="#specifying-keys">keys</a> on how to define join keys.
-{% highlight python %}
-# In this case tuple fields are used as keys. 
-# "0" is the join field on the first tuple
-# "1" is the join field on the second tuple.
-result = input1.join(input2).where(0).equal_to(1)
-{% endhighlight %}
-      </td>
-    </tr>
-
-    <tr>
-      <td><strong>CoGroup</strong></td>
-      <td>
-        <p>The two-dimensional variant of the reduce operation. Groups each input on one or more
-        fields and then joins the groups. The transformation function is called per pair of groups.
-        See <a href="#specifying-keys">keys</a> on how to define coGroup keys.</p>
-{% highlight python %}
-data1.co_group(data2).where(0).equal_to(1)
-{% endhighlight %}
-      </td>
-    </tr>
-
-    <tr>
-      <td><strong>Cross</strong></td>
-      <td>
-        <p>Builds the Cartesian product (cross product) of two inputs, creating all pairs of
-        elements. Optionally uses a CrossFunction to turn the pair of elements into a single
-        element.</p>
-{% highlight python %}
-result = data1.cross(data2)
-{% endhighlight %}
-      </td>
-    </tr>
-    <tr>
-      <td><strong>Union</strong></td>
-      <td>
-        <p>Produces the union of two data sets.</p>
-{% highlight python %}
-data.union(data2)
-{% endhighlight %}
-      </td>
-    </tr>
-  </tbody>
-</table>
-
-[Back to Top](#top)
-
-
-Specifying Keys
--------------
-
-Some transformations (like Join or CoGroup) require that a key is defined on
-its argument DataSets, and other transformations (Reduce, GroupReduce) allow that the DataSet is grouped on a key before they are
-applied.
-
-A DataSet is grouped as
-{% highlight python %}
-reduced = data \
-  .group_by(<define key here>) \
-  .reduce_group(<do something>)
-{% endhighlight %}
-
-The data model of Flink is not based on key-value pairs. Therefore,
-you do not need to physically pack the data set types into keys and
-values. Keys are "virtual": they are defined as functions over the
-actual data to guide the grouping operator.
-
-### Define keys for Tuples
-{:.no_toc}
-
-The simplest case is grouping a data set of Tuples on one or more
-fields of the Tuple:
-{% highlight python %}
-reduced = data \
-  .group_by(0) \
-  .reduce_group(<do something>)
-{% endhighlight %}
-
-The data set is grouped on the first field of the tuples. 
-The group-reduce function will thus receive groups of tuples with
-the same value in the first field.
-
-{% highlight python %}
-grouped = data \
-  .group_by(0,1) \
-  .reduce(/*do something*/)
-{% endhighlight %}
-
-The data set is grouped on the composite key consisting of the first and the
-second fields, therefore the reduce function will receive groups
-with the same value for both fields.
-
-A note on nested Tuples: If you have a DataSet with a nested tuple
-specifying `group_by(<index of tuple>)` will cause the system to use the full tuple as a key.
-
-[Back to top](#top)
-
-
-Passing Functions to Flink
---------------------------
-
-Certain operations require user-defined functions, whereas all of them accept lambda functions and rich functions as arguments.
-
-{% highlight python %}
-data.filter(lambda x: x > 5)
-{% endhighlight %}
-
-{% highlight python %}
-class Filter(FilterFunction):
-    def filter(self, value):
-        return value > 5
-
-data.filter(Filter())
-{% endhighlight %}
-
-Rich functions allow the use of imported functions, provide access to broadcast-variables, 
-can be parameterized using __init__(), and are the go-to-option for complex functions.
-They are also the only way to define an optional `combine` function for a reduce operation.
-
-Lambda functions allow the easy insertion of one-liners. Note that a lambda function has to return
-an iterable, if the operation can return multiple values. (All functions receiving a collector argument)
-
-Flink requires type information at the time when it prepares the program for execution 
-(when the main method of the program is called). This is done by passing an exemplary 
-object that has the desired type. This holds also for tuples.
-
-{% highlight python %}
-(INT, STRING)
-{% endhighlight %}
-
-Would denote a tuple containing an int and a string. Note that for Operations that work strictly on tuples (like cross), no braces are required.
-
-There are a few Constants defined in flink.plan.Constants that allow this in a more readable fashion.
-
-[Back to top](#top)
-
-Data Types
-----------
-
-Flink's Python API currently only supports primitive python types (int, float, bool, string) and byte arrays.
-
-#### Tuples/Lists
-
-You can use the tuples (or lists) for composite types. Python tuples are mapped to the Flink Tuple type, that contain 
-a fix number of fields of various types (up to 25). Every field of a tuple can be a primitive type - including further tuples, resulting in nested tuples.
-
-{% highlight python %}
-word_counts = env.from_elements(("hello", 1), ("world",2))
-
-counts = word_counts.map(lambda x: x[1], INT)
-{% endhighlight %}
-
-When working with operators that require a Key for grouping or matching records,
-Tuples let you simply specify the positions of the fields to be used as key. You can specify more
-than one position to use composite keys (see [Section Data Transformations](#transformations)).
-
-{% highlight python %}
-wordCounts \
-    .group_by(0) \
-    .reduce(MyReduceFunction())
-{% endhighlight %}
-
-[Back to top](#top)
-
-Data Sources
-------------
-
-Data sources create the initial data sets, such as from files or from collections.
-
-File-based:
-
-- `read_text(path)` - Reads files line wise and returns them as Strings.
-- `read_csv(path, type)` - Parses files of comma (or another char) delimited fields.
-  Returns a DataSet of tuples. Supports the basic java types and their Value counterparts as field
-  types.
-
-Collection-based:
-
-- `from_elements(*args)` - Creates a data set from a Seq. All elements
-
-**Examples**
-
-{% highlight python %}
-env  = get_environment
-
-# read text file from local files system
-localLiens = env.read_text("file:#/path/to/my/textfile")
-
- read text file from a HDFS running at nnHost:nnPort
-hdfsLines = env.read_text("hdfs://nnHost:nnPort/path/to/my/textfile")
-
- read a CSV file with three fields
-csvInput = env.read_csv("hdfs:///the/CSV/file", (INT, STRING, DOUBLE))
-
- create a set from some given elements
-values = env.from_elements("Foo", "bar", "foobar", "fubar")
-{% endhighlight %}
-
-[Back to top](#top)
-
-Data Sinks
-----------
-
-Data sinks consume DataSets and are used to store or return them:
-
-- `write_text()` - Writes elements line-wise as Strings. The Strings are
-  obtained by calling the *str()* method of each element.
-- `write_csv(...)` - Writes tuples as comma-separated value files. Row and field
-  delimiters are configurable. The value for each field comes from the *str()* method of the objects.
-- `output()` - Prints the *str()* value of each element on the
-  standard out.
-
-A DataSet can be input to multiple operations. Programs can write or print a data set and at the
-same time run additional transformations on them.
-
-**Examples**
-
-Standard data sink methods:
-
-{% highlight scala %}
- write DataSet to a file on the local file system
-textData.write_text("file:///my/result/on/localFS")
-
- write DataSet to a file on a HDFS with a namenode running at nnHost:nnPort
-textData.write_text("hdfs://nnHost:nnPort/my/result/on/localFS")
-
- write DataSet to a file and overwrite the file if it exists
-textData.write_text("file:///my/result/on/localFS", WriteMode.OVERWRITE)
-
- tuples as lines with pipe as the separator "a|b|c"
-values.write_csv("file:///path/to/the/result/file", line_delimiter="\n", field_delimiter="|")
-
- this writes tuples in the text formatting "(a, b, c)", rather than as CSV lines
-values.write_text("file:///path/to/the/result/file")
-{% endhighlight %}
-
-[Back to top](#top)
-
-Broadcast Variables
--------------------
-
-Broadcast variables allow you to make a data set available to all parallel instances of an
-operation, in addition to the regular input of the operation. This is useful for auxiliary data
-sets, or data-dependent parameterization. The data set will then be accessible at the operator as a
-Collection.
-
-- **Broadcast**: broadcast sets are registered by name via `with_broadcast_set(DataSet, String)`
-- **Access**: accessible via `self.context.get_broadcast_variable(String)` at the target operator
-
-{% highlight python %}
-class MapperBcv(MapFunction):
-    def map(self, value):
-        factor = self.context.get_broadcast_variable("bcv")[0][0]
-        return value * factor
-
-# 1. The DataSet to be broadcasted
-toBroadcast = env.from_elements(1, 2, 3) 
-data = env.from_elements("a", "b")
-
-# 2. Broadcast the DataSet
-data.map(MapperBcv(), INT).with_broadcast_set("bcv", toBroadcast) 
-{% endhighlight %}
-
-Make sure that the names (`bcv` in the previous example) match when registering and
-accessing broadcasted data sets.
-
-**Note**: As the content of broadcast variables is kept in-memory on each node, it should not become
-too large. For simpler things like scalar values you can simply parameterize the rich function.
-
-[Back to top](#top)
-
-Parallel Execution
-------------------
-
-This section describes how the parallel execution of programs can be configured in Flink. A Flink
-program consists of multiple tasks (operators, data sources, and sinks). A task is split into
-several parallel instances for execution and each parallel instance processes a subset of the task's
-input data. The number of parallel instances of a task is called its *parallelism* or *degree of
-parallelism (DOP)*.
-
-The degree of parallelism of a task can be specified in Flink on different levels.
-
-### Execution Environment Level
-
-Flink programs are executed in the context of an [execution environmentt](#program-skeleton). An
-execution environment defines a default parallelism for all operators, data sources, and data sinks
-it executes. Execution environment parallelism can be overwritten by explicitly configuring the
-parallelism of an operator.
-
-The default parallelism of an execution environment can be specified by calling the
-`set_degree_of_parallelism()` method. To execute all operators, data sources, and data sinks of the
-[WordCount](#example-program) example program with a parallelism of `3`, set the default parallelism of the
-execution environment as follows:
-
-{% highlight python %}
-env = get_environment()
-env.set_degree_of_parallelism(3)
-
-text.flat_map(lambda x,c: x.lower().split(), (INT, STRING)) \
-    .group_by(1) \
-    .reduce_group(Adder(), (INT, STRING), combinable=True) \
-    .output()
-
-env.execute()
-{% endhighlight %}
-
-### System Level
-
-A system-wide default parallelism for all execution environments can be defined by setting the
-`parallelization.degree.default` property in `./conf/flink-conf.yaml`. See the
-[Configuration](config.html) documentation for details.
-
-[Back to top](#top)
-
-Executing Plans
----------------
-
-To run the plan with Flink, go to your Flink distribution, and run the pyflink.sh script from the /bin folder. 
-use pyflink2.sh for python 2.7, and pyflink3.sh for python 3.4. The script containing the plan has to be passed 
-as the first argument, followed by a number of additional python packages, and finally, separated by - additional 
-arguments that will be fed to the script. 
-
-{% highlight python %}
-./bin/pyflink<2/3>.sh <Script>[ <pathToPackage1>[ <pathToPackageX]][ - <param1>[ <paramX>]]
-{% endhighlight %}
-
-[Back to top](#top)
-
-Debugging
----------------
-
-If you are running Flink programs locally, you can debug your program following this guide.
-First you have to enable debugging by setting the debug switch in the `env.execute(debug=True)` call. After
-submitting your program, open the jobmanager log file, and look for a line that says 
-`Waiting for external Process : <taskname>. Run python /tmp/flink/executor.py <port>` Now open `/tmp/flink` in your python
-IDE and run the `executor.py <port>`.
-
-[Back to top](#top)

http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/quickstart/java_api_quickstart.md
----------------------------------------------------------------------
diff --git a/docs/quickstart/java_api_quickstart.md b/docs/quickstart/java_api_quickstart.md
new file mode 100644
index 0000000..0485e2a
--- /dev/null
+++ b/docs/quickstart/java_api_quickstart.md
@@ -0,0 +1,151 @@
+---
+title: "Quickstart: Java API"
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+Start working on your Flink Java program in a few simple steps.
+
+
+## Requirements
+
+The only requirements are working __Maven 3.0.4__ (or higher) and __Java 6.x__ (or higher) installations.
+
+## Create Project
+
+Use one of the following commands to __create a project__:
+
+<ul class="nav nav-tabs" style="border-bottom: none;">
+    <li class="active"><a href="#quickstart-script" data-toggle="tab">Run the <strong>quickstart script</strong></a></li>
+    <li><a href="#maven-archetype" data-toggle="tab">Use <strong>Maven archetypes</strong></a></li>
+</ul>
+<div class="tab-content">
+    <div class="tab-pane active" id="quickstart-script">
+    {% highlight bash %}
+    $ curl http://flink.apache.org/q/quickstart.sh | bash
+    {% endhighlight %}
+    </div>
+    <div class="tab-pane" id="maven-archetype">
+    {% highlight bash %}
+    $ mvn archetype:generate                             \
+      -DarchetypeGroupId=org.apache.flink              \
+      -DarchetypeArtifactId=flink-quickstart-java            \
+      -DarchetypeVersion={{site.version}}
+    {% endhighlight %}
+        This allows you to <strong>name your newly created project</strong>. It will interactively ask you for the groupId, artifactId, and package name.
+    </div>
+</div>
+
+## Inspect Project
+
+There will be a new directory in your working directory. If you've used the _curl_ approach, the directory is called `quickstart`. Otherwise, it has the name of your artifactId.
+
+The sample project is a __Maven project__, which contains two classes. _Job_ is a basic skeleton program and _WordCountJob_ a working example. Please note that the _main_ method of both classes allow you to start Flink in a development/testing mode.
+
+We recommend to __import this project into your IDE__ to develop and test it. If you use Eclipse, the [m2e plugin](http://www.eclipse.org/m2e/) allows to [import Maven projects](http://books.sonatype.com/m2eclipse-book/reference/creating-sect-importing-projects.html#fig-creating-import). Some Eclipse bundles include that plugin by default, other require you to install it manually. The IntelliJ IDE also supports Maven projects out of the box.
+
+
+A note to Mac OS X users: The default JVM heapsize for Java is too small for Flink. You have to manually increase it. Choose "Run Configurations" -> Arguments and write into the "VM Arguments" box: "-Xmx800m" in Eclipse.
+
+## Build Project
+
+If you want to __build your project__, go to your project directory and issue the`mvn clean install -Pbuild-jar` command. You will __find a jar__ that runs on every Flink cluster in __target/your-artifact-id-1.0-SNAPSHOT.jar__. There is also a fat-jar,  __target/your-artifact-id-1.0-SNAPSHOT-flink-fat-jar.jar__. This
+also contains all dependencies that get added to the maven project.
+
+## Next Steps
+
+Write your application!
+
+The quickstart project contains a WordCount implementation, the "Hello World" of Big Data processing systems. The goal of WordCount is to determine the frequencies of words in a text, e.g., how often do the terms "the" or "house" occurs in all Wikipedia texts.
+
+__Sample Input__:
+
+~~~bash
+big data is big
+~~~
+
+__Sample Output__:
+
+~~~bash
+big 2
+data 1
+is 1
+~~~
+
+The following code shows the WordCount implementation from the Quickstart which processes some text lines with two operators (FlatMap and Reduce), and writes the prints the resulting words and counts to std-out.
+
+~~~java
+public class WordCount {
+  
+  public static void main(String[] args) throws Exception {
+    
+    // set up the execution environment
+    final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
+    
+    // get input data
+    DataSet<String> text = env.fromElements(
+        "To be, or not to be,--that is the question:--",
+        "Whether 'tis nobler in the mind to suffer",
+        "The slings and arrows of outrageous fortune",
+        "Or to take arms against a sea of troubles,"
+        );
+    
+    DataSet<Tuple2<String, Integer>> counts = 
+        // split up the lines in pairs (2-tuples) containing: (word,1)
+        text.flatMap(new LineSplitter())
+        // group by the tuple field "0" and sum up tuple field "1"
+        .groupBy(0)
+        .aggregate(Aggregations.SUM, 1);
+
+    // emit result
+    counts.print();
+    
+    // execute program
+    env.execute("WordCount Example");
+  }
+}
+~~~
+
+The operations are defined by specialized classes, here the LineSplitter class.
+
+~~~java
+public class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
+
+  @Override
+  public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
+    // normalize and split the line into words
+    String[] tokens = value.toLowerCase().split("\\W+");
+    
+    // emit the pairs
+    for (String token : tokens) {
+      if (token.length() > 0) {
+        out.collect(new Tuple2<String, Integer>(token, 1));
+      }
+    }
+  }
+}
+~~~
+
+{% gh_link /flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/wordcount/WordCount.java "Check GitHub" %} for the full example code.
+
+For a complete overview over our API, have a look at the [Programming Guide](programming_guide.html) and [further example programs](examples.html). If you have any trouble, ask on our [Mailing List](http://mail-archives.apache.org/mod_mbox/flink-dev/). We are happy to provide help.
+

http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/quickstart/run_example_quickstart.md
----------------------------------------------------------------------
diff --git a/docs/quickstart/run_example_quickstart.md b/docs/quickstart/run_example_quickstart.md
new file mode 100644
index 0000000..b75e07c
--- /dev/null
+++ b/docs/quickstart/run_example_quickstart.md
@@ -0,0 +1,155 @@
+---
+title: "Quick Start: Run K-Means Example"
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+This guide walks you through the steps of executing an example program ([K-Means clustering](http://en.wikipedia.org/wiki/K-means_clustering)) on Flink. On the way, you will see the a visualization of the program, the optimized execution plan, and track the progress of its execution.
+
+## Setup Flink
+Follow the [instructions](setup_quickstart.html) to setup Flink and enter the root directory of your Flink setup.
+
+## Generate Input Data
+Flink contains a data generator for K-Means.
+
+~~~bash
+# Assuming you are in the root directory of your Flink setup
+mkdir kmeans
+cd kmeans
+# Run data generator
+java -cp ../examples/flink-java-examples-*-KMeans.jar org.apache.flink.examples.java.clustering.util.KMeansDataGenerator 500 10 0.08
+cp /tmp/points .
+cp /tmp/centers .
+~~~
+
+The generator has the following arguments:
+
+~~~bash
+KMeansDataGenerator <numberOfDataPoints> <numberOfClusterCenters> [<relative stddev>] [<centroid range>] [<seed>]
+~~~
+
+The _relative standard deviation_ is an interesting tuning parameter. It determines the closeness of the points to randomly generated centers.
+
+The `kmeans/` directory should now contain two files: `centers` and `points`. The `points` file contains the points to cluster and the `centers` file contains initial cluster centers.
+
+
+## Inspect the Input Data
+Use the `plotPoints.py` tool to review the generated data points. [Download Python Script](quickstart/plotPoints.py)
+
+~~~ bash
+python plotPoints.py points ./points input
+~~~ 
+
+Note: You might have to install [matplotlib](http://matplotlib.org/) (`python-matplotlib` package on Ubuntu) to use the Python script.
+
+You can review the input data stored in the `input-plot.pdf`, for example with Evince (`evince input-plot.pdf`).
+
+The following overview presents the impact of the different standard deviations on the input data.
+
+|relative stddev = 0.03|relative stddev = 0.08|relative stddev = 0.15|
+|:--------------------:|:--------------------:|:--------------------:|
+|<img src="_img/quickstart-example/kmeans003.png" alt="example1" style="width: 275px;"/>|<img src="_img/quickstart-example/kmeans008.png" alt="example2" style="width: 275px;"/>|<img src="_img/quickstart-example/kmeans015.png" alt="example3" style="width: 275px;"/>|
+
+
+## Start Flink
+Start Flink and the web job submission client on your local machine.
+
+~~~ bash
+# return to the Flink root directory
+cd ..
+# start Flink
+./bin/start-local.sh
+# Start the web client
+./bin/start-webclient.sh
+~~~
+
+## Inspect and Run the K-Means Example Program
+The Flink web client allows to submit Flink programs using a graphical user interface.
+
+<div class="row" style="padding-top:15px">
+	<div class="col-md-6">
+		<a data-lightbox="compiler" href="_img/quickstart-example/run-webclient.png" data-lightbox="example-1"><img class="img-responsive" src="_img/quickstart-example/run-webclient.png" /></a>
+	</div>
+	<div class="col-md-6">
+		1. Open web client on  <a href="http://localhost:8080/launch.html">localhost:8080</a> <br>
+		2. Upload the K-Mean job JAR file. 
+			{% highlight bash %}
+			./examples/flink-java-examples-*-KMeans.jar
+			{% endhighlight %} </br>
+		3. Select it in the left box to see how the operators in the plan are connected to each other. <br>
+		4. Enter the arguments in the lower left box:
+			{% highlight bash %}
+			file://<pathToFlink>/kmeans/points file://<pathToFlink>/kmeans/centers file://<pathToFlink>/kmeans/result 10
+			{% endhighlight %}
+			For example:
+			{% highlight bash %}
+			file:///tmp/flink/kmeans/points file:///tmp/flink/kmeans/centers file:///tmp/flink/kmeans/result 10
+			{% endhighlight %}
+	</div>
+</div>
+<hr>
+<div class="row" style="padding-top:15px">
+	<div class="col-md-6">
+		<a data-lightbox="compiler" href="_img/quickstart-example/compiler-webclient-new.png" data-lightbox="example-1"><img class="img-responsive" src="_img/quickstart-example/compiler-webclient-new.png" /></a>
+	</div>
+
+	<div class="col-md-6">
+		1. Press the <b>RunJob</b> to see the optimizer plan. <br>
+		2. Inspect the operators and see the properties (input sizes, cost estimation) determined by the optimizer.
+	</div>
+</div>
+<hr>
+<div class="row" style="padding-top:15px">
+	<div class="col-md-6">
+		<a data-lightbox="compiler" href="_img/quickstart-example/jobmanager-running-new.png" data-lightbox="example-1"><img class="img-responsive" src="_img/quickstart-example/jobmanager-running-new.png" /></a>
+	</div>
+	<div class="col-md-6">
+		1. Press the <b>Continue</b> button to start executing the job. <br>
+		2. <a href="http://localhost:8080/launch.html">Open Flink's monitoring interface</a> to see the job's progress. (Due to the small input data, the job will finish really quick!)<br>
+		3. Once the job has finished, you can analyze the runtime of the individual operators.
+	</div>
+</div>
+
+## Shutdown Flink
+Stop Flink when you are done.
+
+~~~ bash
+# stop Flink
+./bin/stop-local.sh
+# Stop the Flink web client
+./bin/stop-webclient.sh
+~~~
+
+## Analyze the Result
+Use the [Python Script](quickstart/plotPoints.py) again to visualize the result.
+
+~~~bash
+cd kmeans
+python plotPoints.py result ./result clusters
+~~~
+
+The following three pictures show the results for the sample input above. Play around with the parameters (number of iterations, number of clusters) to see how they affect the result.
+
+
+|relative stddev = 0.03|relative stddev = 0.08|relative stddev = 0.15|
+|:--------------------:|:--------------------:|:--------------------:|
+|<img src="_img/quickstart-example/result003.png" alt="example1" style="width: 275px;"/>|<img src="_img/quickstart-example/result008.png" alt="example2" style="width: 275px;"/>|<img src="_img/quickstart-example/result015.png" alt="example3" style="width: 275px;"/>|
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/quickstart/scala_api_quickstart.md
----------------------------------------------------------------------
diff --git a/docs/quickstart/scala_api_quickstart.md b/docs/quickstart/scala_api_quickstart.md
new file mode 100644
index 0000000..771acfc
--- /dev/null
+++ b/docs/quickstart/scala_api_quickstart.md
@@ -0,0 +1,136 @@
+---
+title: "Quickstart: Scala API"
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+Start working on your Flink Scala program in a few simple steps.
+
+## Requirements
+
+The only requirements are working __Maven 3.0.4__ (or higher) and __Java 6.x__ (or higher) installations.
+
+
+## Create Project
+
+Use one of the following commands to __create a project__:
+
+<ul class="nav nav-tabs" style="border-bottom: none;">
+    <li class="active"><a href="#quickstart-script" data-toggle="tab">Run the <strong>quickstart script</strong></a></li>
+    <li><a href="#maven-archetype" data-toggle="tab">Use <strong>Maven archetypes</strong></a></li>
+</ul>
+<div class="tab-content">
+    <div class="tab-pane active" id="quickstart-script">
+{% highlight bash %}
+$ curl http://flink.apache.org/q/quickstart-scala.sh | bash
+{% endhighlight %}
+    </div>
+    <div class="tab-pane" id="maven-archetype">
+{% highlight bash %}
+$ mvn archetype:generate                             \
+  -DarchetypeGroupId=org.apache.flink              \
+  -DarchetypeArtifactId=flink-quickstart-scala           \
+  -DarchetypeVersion={{site.version}}
+{% endhighlight %}
+    This allows you to <strong>name your newly created project</strong>. It will interactively ask you for the groupId, artifactId, and package name.
+    </div>
+</div>
+
+
+## Inspect Project
+
+There will be a new directory in your working directory. If you've used the _curl_ approach, the directory is called `quickstart`. Otherwise, it has the name of your artifactId.
+
+The sample project is a __Maven project__, which contains two classes. _Job_ is a basic skeleton program and _WordCountJob_ a working example. Please note that the _main_ method of both classes allow you to start Flink in a development/testing mode.
+
+We recommend to __import this project into your IDE__. For Eclipse, you need the following plugins, which you can install from the provided Eclipse Update Sites:
+
+* _Eclipse 4.x_
+  * [Scala IDE](http://download.scala-ide.org/sdk/e38/scala210/stable/site)
+  * [m2eclipse-scala](http://alchim31.free.fr/m2e-scala/update-site)
+  * [Build Helper Maven Plugin](https://repository.sonatype.org/content/repositories/forge-sites/m2e-extras/0.15.0/N/0.15.0.201206251206/)
+* _Eclipse 3.7_
+  * [Scala IDE](http://download.scala-ide.org/sdk/e37/scala210/stable/site)
+  * [m2eclipse-scala](http://alchim31.free.fr/m2e-scala/update-site)
+  * [Build Helper Maven Plugin](https://repository.sonatype.org/content/repositories/forge-sites/m2e-extras/0.14.0/N/0.14.0.201109282148/)
+
+The IntelliJ IDE also supports Maven and offers a plugin for Scala development.
+
+
+## Build Project
+
+If you want to __build your project__, go to your project directory and issue the `mvn clean package -Pbuild-jar` command. You will __find a jar__ that runs on every Flink cluster in __target/your-artifact-id-1.0-SNAPSHOT.jar__. There is also a fat-jar,  __target/your-artifact-id-1.0-SNAPSHOT-flink-fat-jar.jar__. This
+also contains all dependencies that get added to the maven project.
+
+## Next Steps
+
+Write your application!
+
+The quickstart project contains a WordCount implementation, the "Hello World" of Big Data processing systems. The goal of WordCount is to determine the frequencies of words in a text, e.g., how often do the terms "the" or "house" occurs in all Wikipedia texts.
+
+__Sample Input__:
+
+~~~bash
+big data is big
+~~~
+
+__Sample Output__:
+
+~~~bash
+big 2
+data 1
+is 1
+~~~
+
+The following code shows the WordCount implementation from the Quickstart which processes some text lines with two operators (FlatMap and Reduce), and writes the prints the resulting words and counts to std-out.
+
+~~~scala
+object WordCountJob {
+  def main(args: Array[String]) {
+
+    // set up the execution environment
+    val env = ExecutionEnvironment.getExecutionEnvironment
+
+    // get input data
+    val text = env.fromElements("To be, or not to be,--that is the question:--",
+      "Whether 'tis nobler in the mind to suffer", "The slings and arrows of outrageous fortune",
+      "Or to take arms against a sea of troubles,")
+
+    val counts = text.flatMap { _.toLowerCase.split("\\W+") }
+      .map { (_, 1) }
+      .groupBy(0)
+      .sum(1)
+
+    // emit result
+    counts.print()
+
+    // execute program
+    env.execute("WordCount Example")
+  }
+}
+~~~
+
+{% gh_link /flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/wordcount/WordCount.scala "Check GitHub" %} for the full example code.
+
+For a complete overview over our API, have a look at the [Programming Guide](programming_guide.html) and [further example programs](examples.html). If you have any trouble, ask on our [Mailing List](http://mail-archives.apache.org/mod_mbox/flink-dev/). We are happy to provide help.
+
+

http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/quickstart/setup_quickstart.md
----------------------------------------------------------------------
diff --git a/docs/quickstart/setup_quickstart.md b/docs/quickstart/setup_quickstart.md
new file mode 100644
index 0000000..9fd60f1
--- /dev/null
+++ b/docs/quickstart/setup_quickstart.md
@@ -0,0 +1,155 @@
+---
+title: "Quickstart: Setup"
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+Get Flink up and running in a few simple steps.
+
+## Requirements
+
+Flink runs on __Linux, Mac OS X, and Windows__. To be able to run Flink, the
+only requirement is to have a working __Java 6.x__ (or higher)
+installation. Windows users, please take a look at the
+[Flink on Windows](local_setup.html#flink-on-windows) guide which describes
+how to run Flink on Windows for local setups.
+
+## Download
+Download the ready to run binary package. Choose the Flink distribution that __matches your Hadoop version__. If you are unsure which version to choose or you just want to run locally, pick the package for Hadoop 1.2.
+
+<ul class="nav nav-tabs">
+  <li class="active"><a href="#bin-hadoop1" data-toggle="tab">Hadoop 1.2</a></li>
+  <li><a href="#bin-hadoop2" data-toggle="tab">Hadoop 2 (YARN)</a></li>
+</ul>
+<p>
+<div class="tab-content text-center">
+  <div class="tab-pane active" id="bin-hadoop1">
+    <a class="btn btn-info btn-lg" onclick="_gaq.push(['_trackEvent','Action','download-quickstart-setup-1',this.href]);" href="{{site.FLINK_DOWNLOAD_URL_HADOOP1_STABLE}}"><i class="icon-download"> </i> Download Flink for Hadoop 1.2</a>
+  </div>
+  <div class="tab-pane" id="bin-hadoop2">
+    <a class="btn btn-info btn-lg" onclick="_gaq.push(['_trackEvent','Action','download-quickstart-setup-2',this.href]);" href="{{site.FLINK_DOWNLOAD_URL_HADOOP2_STABLE}}"><i class="icon-download"> </i> Download Flink for Hadoop 2</a>
+  </div>
+</div>
+</p>
+
+
+## Start
+  
+1. Go to the download directory.
+2. Unpack the downloaded archive.
+3. Start Flink.
+
+
+~~~bash
+$ cd ~/Downloads        # Go to download directory
+$ tar xzf flink-*.tgz   # Unpack the downloaded archive
+$ cd flink-{{site.version}}
+$ bin/start-local.sh    # Start Flink
+~~~
+
+Check the __JobManager's web frontend__ at [http://localhost:8081](http://localhost:8081) and make
+sure everything is up and running.
+
+## Run Example
+
+Run the __Word Count example__ to see Flink at work.
+
+* __Download test data__:
+
+  ~~~bash
+  $ wget -O hamlet.txt http://www.gutenberg.org/cache/epub/1787/pg1787.txt
+  ~~~ 
+
+* You now have a text file called _hamlet.txt_ in your working directory.
+* __Start the example program__:
+  
+  ~~~bash
+  $ bin/flink run ./examples/flink-java-examples-{{site.version}}-WordCount.jar file://`pwd`/hamlet.txt file://`pwd`/wordcount-result.txt
+  ~~~
+
+* You will find a file called __wordcount-result.txt__ in your current directory.
+  
+
+## Cluster Setup
+  
+__Running Flink on a cluster__ is as easy as running it locally. Having __passwordless SSH__ and
+__the same directory structure__ on all your cluster nodes lets you use our scripts to control
+everything.
+
+1. Copy the unpacked __flink__ directory from the downloaded archive to the same file system path
+on each node of your setup.
+2. Choose a __master node__ (JobManager) and set the `jobmanager.rpc.address` key in
+`conf/flink-conf.yaml` to its IP or hostname. Make sure that all nodes in your cluster have the same
+`jobmanager.rpc.address` configured.
+3. Add the IPs or hostnames (one per line) of all __worker nodes__ (TaskManager) to the slaves files
+in `conf/slaves`.
+
+You can now __start the cluster__ at your master node with `bin/start-cluster.sh`.
+
+
+The following __example__ illustrates the setup with three nodes (with IP addresses from _10.0.0.1_
+to _10.0.0.3_ and hostnames _master_, _worker1_, _worker2_) and shows the contents of the
+configuration files, which need to be accessible at the same path on all machines:
+
+<div class="row">
+  <div class="col-md-6 text-center">
+    <img src="_img/quickstart_cluster.png" style="width: 85%">
+  </div>
+<div class="col-md-6">
+  <div class="row">
+    <p class="lead text-center">
+      /path/to/<strong>flink/conf/<br>flink-conf.yaml</strong>
+    <pre>jobmanager.rpc.address: 10.0.0.1</pre>
+    </p>
+  </div>
+<div class="row" style="margin-top: 1em;">
+  <p class="lead text-center">
+    /path/to/<strong>flink/<br>conf/slaves</strong>
+  <pre>
+10.0.0.2
+10.0.0.3</pre>
+  </p>
+</div>
+</div>
+</div>
+
+Have a look at the [Configuration](config.html) section of the documentation to see other available configuration options.
+For Flink to run efficiently, a few configuration values need to be set.
+
+In particular, 
+
+ * the amount of available memory per TaskManager (`taskmanager.heap.mb`), 
+ * the number of available CPUs per machine (`taskmanager.numberOfTaskSlots`),
+ * the total number of CPUs in the cluster (`parallelism.default`) and
+ * the temporary directories (`taskmanager.tmp.dirs`)
+
+
+are very important configuration values.
+
+## Flink on YARN
+You can easily deploy Flink on your existing __YARN cluster__. 
+
+1. Download the __Flink YARN package__ with the YARN client: [Flink for YARN]({{site.FLINK_DOWNLOAD_URL_YARN_STABLE}})
+2. Make sure your __HADOOP_HOME__ (or _YARN_CONF_DIR_ or _HADOOP_CONF_DIR_) __environment variable__ is set to read your YARN and HDFS configuration.
+3. Run the __YARN client__ with: `./bin/yarn-session.sh`. You can run the client with options `-n 10 -tm 8192` to allocate 10 TaskManagers with 8GB of memory each.
+
+For __more detailed instructions__, check out the programming Guides and examples.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/run_example_quickstart.md
----------------------------------------------------------------------
diff --git a/docs/run_example_quickstart.md b/docs/run_example_quickstart.md
deleted file mode 100644
index 999a249..0000000
--- a/docs/run_example_quickstart.md
+++ /dev/null
@@ -1,155 +0,0 @@
----
-title: "Quick Start: Run K-Means Example"
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-This guide walks you through the steps of executing an example program ([K-Means clustering](http://en.wikipedia.org/wiki/K-means_clustering)) on Flink. On the way, you will see the a visualization of the program, the optimized execution plan, and track the progress of its execution.
-
-## Setup Flink
-Follow the [instructions](setup_quickstart.html) to setup Flink and enter the root directory of your Flink setup.
-
-## Generate Input Data
-Flink contains a data generator for K-Means.
-
-~~~bash
-# Assuming you are in the root directory of your Flink setup
-mkdir kmeans
-cd kmeans
-# Run data generator
-java -cp ../examples/flink-java-examples-*-KMeans.jar org.apache.flink.examples.java.clustering.util.KMeansDataGenerator 500 10 0.08
-cp /tmp/points .
-cp /tmp/centers .
-~~~
-
-The generator has the following arguments:
-
-~~~bash
-KMeansDataGenerator <numberOfDataPoints> <numberOfClusterCenters> [<relative stddev>] [<centroid range>] [<seed>]
-~~~
-
-The _relative standard deviation_ is an interesting tuning parameter. It determines the closeness of the points to randomly generated centers.
-
-The `kmeans/` directory should now contain two files: `centers` and `points`. The `points` file contains the points to cluster and the `centers` file contains initial cluster centers.
-
-
-## Inspect the Input Data
-Use the `plotPoints.py` tool to review the generated data points. [Download Python Script](quickstart/plotPoints.py)
-
-~~~ bash
-python plotPoints.py points ./points input
-~~~ 
-
-Note: You might have to install [matplotlib](http://matplotlib.org/) (`python-matplotlib` package on Ubuntu) to use the Python script.
-
-You can review the input data stored in the `input-plot.pdf`, for example with Evince (`evince input-plot.pdf`).
-
-The following overview presents the impact of the different standard deviations on the input data.
-
-|relative stddev = 0.03|relative stddev = 0.08|relative stddev = 0.15|
-|:--------------------:|:--------------------:|:--------------------:|
-|<img src="img/quickstart-example/kmeans003.png" alt="example1" style="width: 275px;"/>|<img src="img/quickstart-example/kmeans008.png" alt="example2" style="width: 275px;"/>|<img src="img/quickstart-example/kmeans015.png" alt="example3" style="width: 275px;"/>|
-
-
-## Start Flink
-Start Flink and the web job submission client on your local machine.
-
-~~~ bash
-# return to the Flink root directory
-cd ..
-# start Flink
-./bin/start-local.sh
-# Start the web client
-./bin/start-webclient.sh
-~~~
-
-## Inspect and Run the K-Means Example Program
-The Flink web client allows to submit Flink programs using a graphical user interface.
-
-<div class="row" style="padding-top:15px">
-	<div class="col-md-6">
-		<a data-lightbox="compiler" href="img/quickstart-example/run-webclient.png" data-lightbox="example-1"><img class="img-responsive" src="img/quickstart-example/run-webclient.png" /></a>
-	</div>
-	<div class="col-md-6">
-		1. Open web client on  <a href="http://localhost:8080/launch.html">localhost:8080</a> <br>
-		2. Upload the K-Mean job JAR file. 
-			{% highlight bash %}
-			./examples/flink-java-examples-*-KMeans.jar
-			{% endhighlight %} </br>
-		3. Select it in the left box to see how the operators in the plan are connected to each other. <br>
-		4. Enter the arguments in the lower left box:
-			{% highlight bash %}
-			file://<pathToFlink>/kmeans/points file://<pathToFlink>/kmeans/centers file://<pathToFlink>/kmeans/result 10
-			{% endhighlight %}
-			For example:
-			{% highlight bash %}
-			file:///tmp/flink/kmeans/points file:///tmp/flink/kmeans/centers file:///tmp/flink/kmeans/result 10
-			{% endhighlight %}
-	</div>
-</div>
-<hr>
-<div class="row" style="padding-top:15px">
-	<div class="col-md-6">
-		<a data-lightbox="compiler" href="img/quickstart-example/compiler-webclient-new.png" data-lightbox="example-1"><img class="img-responsive" src="img/quickstart-example/compiler-webclient-new.png" /></a>
-	</div>
-
-	<div class="col-md-6">
-		1. Press the <b>RunJob</b> to see the optimizer plan. <br>
-		2. Inspect the operators and see the properties (input sizes, cost estimation) determined by the optimizer.
-	</div>
-</div>
-<hr>
-<div class="row" style="padding-top:15px">
-	<div class="col-md-6">
-		<a data-lightbox="compiler" href="img/quickstart-example/jobmanager-running-new.png" data-lightbox="example-1"><img class="img-responsive" src="img/quickstart-example/jobmanager-running-new.png" /></a>
-	</div>
-	<div class="col-md-6">
-		1. Press the <b>Continue</b> button to start executing the job. <br>
-		2. <a href="http://localhost:8080/launch.html">Open Flink's monitoring interface</a> to see the job's progress. (Due to the small input data, the job will finish really quick!)<br>
-		3. Once the job has finished, you can analyze the runtime of the individual operators.
-	</div>
-</div>
-
-## Shutdown Flink
-Stop Flink when you are done.
-
-~~~ bash
-# stop Flink
-./bin/stop-local.sh
-# Stop the Flink web client
-./bin/stop-webclient.sh
-~~~
-
-## Analyze the Result
-Use the [Python Script](quickstart/plotPoints.py) again to visualize the result.
-
-~~~bash
-cd kmeans
-python plotPoints.py result ./result clusters
-~~~
-
-The following three pictures show the results for the sample input above. Play around with the parameters (number of iterations, number of clusters) to see how they affect the result.
-
-
-|relative stddev = 0.03|relative stddev = 0.08|relative stddev = 0.15|
-|:--------------------:|:--------------------:|:--------------------:|
-|<img src="img/quickstart-example/result003.png" alt="example1" style="width: 275px;"/>|<img src="img/quickstart-example/result008.png" alt="example2" style="width: 275px;"/>|<img src="img/quickstart-example/result015.png" alt="example3" style="width: 275px;"/>|
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/scala_api_quickstart.md
----------------------------------------------------------------------
diff --git a/docs/scala_api_quickstart.md b/docs/scala_api_quickstart.md
deleted file mode 100644
index 48767cc..0000000
--- a/docs/scala_api_quickstart.md
+++ /dev/null
@@ -1,136 +0,0 @@
----
-title: "Quickstart: Scala API"
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-Start working on your Flink Scala program in a few simple steps.
-
-## Requirements
-
-The only requirements are working __Maven 3.0.4__ (or higher) and __Java 6.x__ (or higher) installations.
-
-
-## Create Project
-
-Use one of the following commands to __create a project__:
-
-<ul class="nav nav-tabs" style="border-bottom: none;">
-    <li class="active"><a href="#quickstart-script" data-toggle="tab">Run the <strong>quickstart script</strong></a></li>
-    <li><a href="#maven-archetype" data-toggle="tab">Use <strong>Maven archetypes</strong></a></li>
-</ul>
-<div class="tab-content">
-    <div class="tab-pane active" id="quickstart-script">
-{% highlight bash %}
-$ curl http://flink.apache.org/q/quickstart-scala.sh | bash
-{% endhighlight %}
-    </div>
-    <div class="tab-pane" id="maven-archetype">
-{% highlight bash %}
-$ mvn archetype:generate                             \
-  -DarchetypeGroupId=org.apache.flink              \
-  -DarchetypeArtifactId=flink-quickstart-scala           \
-  -DarchetypeVersion={{site.FLINK_VERSION_SHORT}}
-{% endhighlight %}
-    This allows you to <strong>name your newly created project</strong>. It will interactively ask you for the groupId, artifactId, and package name.
-    </div>
-</div>
-
-
-## Inspect Project
-
-There will be a new directory in your working directory. If you've used the _curl_ approach, the directory is called `quickstart`. Otherwise, it has the name of your artifactId.
-
-The sample project is a __Maven project__, which contains two classes. _Job_ is a basic skeleton program and _WordCountJob_ a working example. Please note that the _main_ method of both classes allow you to start Flink in a development/testing mode.
-
-We recommend to __import this project into your IDE__. For Eclipse, you need the following plugins, which you can install from the provided Eclipse Update Sites:
-
-* _Eclipse 4.x_
-  * [Scala IDE](http://download.scala-ide.org/sdk/e38/scala210/stable/site)
-  * [m2eclipse-scala](http://alchim31.free.fr/m2e-scala/update-site)
-  * [Build Helper Maven Plugin](https://repository.sonatype.org/content/repositories/forge-sites/m2e-extras/0.15.0/N/0.15.0.201206251206/)
-* _Eclipse 3.7_
-  * [Scala IDE](http://download.scala-ide.org/sdk/e37/scala210/stable/site)
-  * [m2eclipse-scala](http://alchim31.free.fr/m2e-scala/update-site)
-  * [Build Helper Maven Plugin](https://repository.sonatype.org/content/repositories/forge-sites/m2e-extras/0.14.0/N/0.14.0.201109282148/)
-
-The IntelliJ IDE also supports Maven and offers a plugin for Scala development.
-
-
-## Build Project
-
-If you want to __build your project__, go to your project directory and issue the `mvn clean package -Pbuild-jar` command. You will __find a jar__ that runs on every Flink cluster in __target/your-artifact-id-1.0-SNAPSHOT.jar__. There is also a fat-jar,  __target/your-artifact-id-1.0-SNAPSHOT-flink-fat-jar.jar__. This
-also contains all dependencies that get added to the maven project.
-
-## Next Steps
-
-Write your application!
-
-The quickstart project contains a WordCount implementation, the "Hello World" of Big Data processing systems. The goal of WordCount is to determine the frequencies of words in a text, e.g., how often do the terms "the" or "house" occurs in all Wikipedia texts.
-
-__Sample Input__:
-
-~~~bash
-big data is big
-~~~
-
-__Sample Output__:
-
-~~~bash
-big 2
-data 1
-is 1
-~~~
-
-The following code shows the WordCount implementation from the Quickstart which processes some text lines with two operators (FlatMap and Reduce), and writes the prints the resulting words and counts to std-out.
-
-~~~scala
-object WordCountJob {
-  def main(args: Array[String]) {
-
-    // set up the execution environment
-    val env = ExecutionEnvironment.getExecutionEnvironment
-
-    // get input data
-    val text = env.fromElements("To be, or not to be,--that is the question:--",
-      "Whether 'tis nobler in the mind to suffer", "The slings and arrows of outrageous fortune",
-      "Or to take arms against a sea of troubles,")
-
-    val counts = text.flatMap { _.toLowerCase.split("\\W+") }
-      .map { (_, 1) }
-      .groupBy(0)
-      .sum(1)
-
-    // emit result
-    counts.print()
-
-    // execute program
-    env.execute("WordCount Example")
-  }
-}
-~~~
-
-{% gh_link /flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/wordcount/WordCount.scala "Check GitHub" %} for the full example code.
-
-For a complete overview over our API, have a look at the [Programming Guide](programming_guide.html) and [further example programs](examples.html). If you have any trouble, ask on our [Mailing List](http://mail-archives.apache.org/mod_mbox/flink-dev/). We are happy to provide help.
-
-

http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/search-results.md
----------------------------------------------------------------------
diff --git a/docs/search-results.md b/docs/search-results.md
new file mode 100644
index 0000000..633c658
--- /dev/null
+++ b/docs/search-results.md
@@ -0,0 +1,35 @@
+---
+title:  "Search Results"
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<script>
+  (function() {
+    var cx = '000888944958067045520:z90hn2izm0k';
+    var gcse = document.createElement('script');
+    gcse.type = 'text/javascript';
+    gcse.async = true;
+    gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
+        '//www.google.com/cse/cse.js?cx=' + cx;
+    var s = document.getElementsByTagName('script')[0];
+    s.parentNode.insertBefore(gcse, s);
+  })();
+</script>
+
+<gcse:searchresults-only></gcse:searchresults-only>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/search.md
----------------------------------------------------------------------
diff --git a/docs/search.md b/docs/search.md
deleted file mode 100644
index d7505ce..0000000
--- a/docs/search.md
+++ /dev/null
@@ -1,36 +0,0 @@
----
-title:  "Search"
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-<script>
-  (function() {
-    var cx = '000888944958067045520:z90hn2izm0k';
-    var gcse = document.createElement('script');
-    gcse.type = 'text/javascript';
-    gcse.async = true;
-    gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
-        '//www.google.com/cse/cse.js?cx=' + cx;
-    var s = document.getElementsByTagName('script')[0];
-    s.parentNode.insertBefore(gcse, s);
-  })();
-</script>
-
-<gcse:searchresults-only></gcse:searchresults-only>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/setup/building.md
----------------------------------------------------------------------
diff --git a/docs/setup/building.md b/docs/setup/building.md
new file mode 100644
index 0000000..2fcf412
--- /dev/null
+++ b/docs/setup/building.md
@@ -0,0 +1,107 @@
+---
+title:  "Build Flink"
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+In order to build Flink, you need the source code. Either download the source of a release or clone the git repository. In addition to that, you need Maven 3 and a JDK (Java Development Kit). Note that you can not build Flink with Oracle JDK 6 due to a unresolved bug in the Oracle Java compiler. It works well with OpenJDK 6 and all Java 7 and 8 compilers.
+
+To clone from git, enter:
+
+~~~bash
+git clone {{ site.github_url }}
+~~~
+
+The simplest way of building Flink is by running:
+
+~~~bash
+cd flink
+mvn clean install -DskipTests
+~~~
+
+This instructs Maven (`mvn`) to first remove all existing builds (`clean`) and then create a new Flink binary (`install`). The `-DskipTests` command prevents Maven from executing the unit tests. 
+
+[Read more](http://maven.apache.org/) about Apache Maven.
+
+
+
+## Build Flink for a specific Hadoop Version
+
+This section covers building Flink for a specific Hadoop version. Most users do not need to do this manually. The download page of Flink contains binary packages for common setups.
+
+The problem is that Flink uses HDFS and YARN which are both dependencies from Apache Hadoop. There exist many different versions of Hadoop (from both the upstream project and the different Hadoop distributions). If a user is using a wrong combination of versions, exceptions like this one occur:
+
+~~~bash
+ERROR: The job was not successfully submitted to the nephele job manager:
+    org.apache.flink.nephele.executiongraph.GraphConversionException: Cannot compute input splits for TSV:
+    java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException:
+    Protocol message contained an invalid tag (zero).; Host Details :
+~~~
+
+There are two main versions of Hadoop that we need to differentiate:
+- Hadoop 1, with all versions starting with zero or one, like 0.20, 0.23 or 1.2.1.
+- Hadoop 2, with all versions starting with 2, like 2.2.0.
+The main differentiation between Hadoop 1 and Hadoop 2 is the availability of Hadoop YARN (Hadoops cluster resource manager).
+
+By default, Flink is using the Hadoop 2 dependencies.
+
+**To build Flink for Hadoop 1**, issue the following command:
+
+~~~bash
+mvn clean install -DskipTests -Dhadoop.profile=1
+~~~
+
+The `-Dhadoop.profile=1` flag instructs Maven to build Flink for Hadoop 1. Note that the features included in Flink change when using a different Hadoop profile. In particular the support for YARN and the build-in HBase support are not available in Hadoop 1 builds.
+
+
+You can also **specify a specific Hadoop version to build against**:
+
+~~~bash
+mvn clean install -DskipTests -Dhadoop.version=2.4.1
+~~~
+
+
+**To build Flink against a vendor specific Hadoop version**, issue the following command:
+
+~~~bash
+mvn clean install -DskipTests -Pvendor-repos -Dhadoop.version=2.2.0-cdh5.0.0-beta-2
+~~~
+
+The `-Pvendor-repos` activates a Maven [build profile](http://maven.apache.org/guides/introduction/introduction-to-profiles.html) that includes the repositories of popular Hadoop vendors such as Cloudera, Hortonworks, or MapR.
+
+**Build Flink for `hadoop2` versions before 2.2.0**
+
+Maven will automatically build Flink with its YARN client. But there were some changes in Hadoop versions before the 2.2.0 Hadoop release that are not supported by Flink's YARN client. Therefore, you can disable building the YARN client with the following string: `-P!include-yarn`. 
+
+So if you are building Flink for Hadoop `2.0.0-alpha`, use the following command:
+
+~~~bash
+-P!include-yarn -Dhadoop.version=2.0.0-alpha
+~~~
+
+## Background
+
+The builds with Maven are controlled by [properties](http://maven.apache.org/pom.html#Properties) and <a href="http://maven.apache.org/guides/introduction/introduction-to-profiles.html">build profiles</a>.
+There are two profiles, one for hadoop1 and one for hadoop2. When the hadoop2 profile is enabled (default), the system will also build the YARN client.
+
+To enable the hadoop1 profile, set `-Dhadoop.profile=1` when building.
+Depending on the profile, there are two Hadoop versions, set via properties. For "hadoop1", we use 1.2.1 by default, for "hadoop2" it is 2.2.0.
+
+You can change these versions with the `hadoop-two.version` (or `hadoop-one.version`) property. For example `-Dhadoop-two.version=2.4.0`.
+

http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/setup/cluster_setup.md
----------------------------------------------------------------------
diff --git a/docs/setup/cluster_setup.md b/docs/setup/cluster_setup.md
new file mode 100644
index 0000000..dcfeb94
--- /dev/null
+++ b/docs/setup/cluster_setup.md
@@ -0,0 +1,346 @@
+---
+title:  "Cluster Setup"
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This documentation is intended to provide instructions on how to run
+Flink in a fully distributed fashion on a static (but possibly
+heterogeneous) cluster.
+
+This involves two steps. First, installing and configuring Flink and
+second installing and configuring the [Hadoop Distributed
+Filesystem](http://hadoop.apache.org/) (HDFS).
+
+* This will be replaced by the TOC
+{:toc}
+
+## Preparing the Cluster
+
+### Software Requirements
+
+Flink runs on all *UNIX-like environments*, e.g. **Linux**, **Mac OS X**,
+and **Cygwin** (for Windows) and expects the cluster to consist of **one master
+node** and **one or more worker nodes**. Before you start to setup the system,
+make sure you have the following software installed **on each node**:
+
+- **Java 1.6.x** or higher,
+- **ssh** (sshd must be running to use the Flink scripts that manage
+  remote components)
+
+If your cluster does not fulfill these software requirements you will need to
+install/upgrade it.
+
+For example, on Ubuntu Linux, type in the following commands to install Java and
+ssh:
+
+~~~bash
+sudo apt-get install ssh 
+sudo apt-get install openjdk-7-jre
+~~~
+
+You can check the correct installation of Java by issuing the following command:
+
+~~~bash
+java -version
+~~~
+
+The command should output something comparable to the following on every node of
+your cluster (depending on your Java version, there may be small differences):
+
+~~~bash
+java version "1.6.0_22"
+Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
+Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
+~~~
+
+To make sure the ssh daemon is running properly, you can use the command
+
+~~~bash
+ps aux | grep sshd
+~~~
+
+Something comparable to the following line should appear in the output
+of the command on every host of your cluster:
+
+~~~bash
+root       894  0.0  0.0  49260   320 ?        Ss   Jan09   0:13 /usr/sbin/sshd
+~~~
+
+### Configuring Remote Access with ssh
+
+In order to start/stop the remote processes, the master node requires access via
+ssh to the worker nodes. It is most convenient to use ssh's public key
+authentication for this. To setup public key authentication, log on to the
+master as the user who will later execute all the Flink components. **The
+same user (i.e. a user with the same user name) must also exist on all worker
+nodes**. For the remainder of this instruction we will refer to this user as
+*flink*. Using the super user *root* is highly discouraged for security
+reasons.
+
+Once you logged in to the master node as the desired user, you must generate a
+new public/private key pair. The following command will create a new
+public/private key pair into the *.ssh* directory inside the home directory of
+the user *flink*. See the ssh-keygen man page for more details. Note that
+the private key is not protected by a passphrase.
+
+~~~bash
+ssh-keygen -b 2048 -P '' -f ~/.ssh/id_rsa
+~~~
+
+Next, copy/append the content of the file *.ssh/id_rsa.pub* to your
+authorized_keys file. The content of the authorized_keys file defines which
+public keys are considered trustworthy during the public key authentication
+process. On most systems the appropriate command is
+
+~~~bash
+cat .ssh/id_rsa.pub >> .ssh/authorized_keys
+~~~
+
+On some Linux systems, the authorized keys file may also be expected by the ssh
+daemon under *.ssh/authorized_keys2*. In either case, you should make sure the
+file only contains those public keys which you consider trustworthy for each
+node of cluster.
+
+Finally, the authorized keys file must be copied to every worker node of your
+cluster. You can do this by repeatedly typing in
+
+~~~bash
+scp .ssh/authorized_keys <worker>:~/.ssh/
+~~~
+
+and replacing *\<worker\>* with the host name of the respective worker node.
+After having finished the copy process, you should be able to log on to each
+worker node from your master node via ssh without a password.
+
+### Setting JAVA_HOME on each Node
+
+Flink requires the `JAVA_HOME` environment variable to be set on the
+master and all worker nodes and point to the directory of your Java
+installation.
+
+You can set this variable in `conf/flink-conf.yaml` via the
+`env.java.home` key.
+
+Alternatively, add the following line to your shell profile. If you use the
+*bash* shell (probably the most common shell), the shell profile is located in
+*\~/.bashrc*:
+
+~~~bash
+export JAVA_HOME=/path/to/java_home/
+~~~
+
+If your ssh daemon supports user environments, you can also add `JAVA_HOME` to
+*.\~/.ssh/environment*. As super user *root* you can enable ssh user
+environments with the following commands:
+
+~~~bash
+echo "PermitUserEnvironment yes" >> /etc/ssh/sshd_config
+/etc/init.d/ssh restart
+~~~
+
+## Hadoop Distributed Filesystem (HDFS) Setup
+
+The Flink system currently uses the Hadoop Distributed Filesystem (HDFS)
+to read and write data in a distributed fashion. It is possible to use
+Flink without HDFS or other distributed file systems.
+
+Make sure to have a running HDFS installation. The following instructions are
+just a general overview of some required settings. Please consult one of the
+many installation guides available online for more detailed instructions.
+
+__Note that the following instructions are based on Hadoop 1.2 and might differ 
+for Hadoop 2.__
+
+### Downloading, Installing, and Configuring HDFS
+
+Similar to the Flink system HDFS runs in a distributed fashion. HDFS
+consists of a **NameNode** which manages the distributed file system's meta
+data. The actual data is stored by one or more **DataNodes**. For the remainder
+of this instruction we assume the HDFS's NameNode component runs on the master
+node while all the worker nodes run an HDFS DataNode.
+
+To start, log on to your master node and download Hadoop (which includes  HDFS)
+from the Apache [Hadoop Releases](http://hadoop.apache.org/releases.html) page.
+
+Next, extract the Hadoop archive.
+
+After having extracted the Hadoop archive, change into the Hadoop directory and
+edit the Hadoop environment configuration file:
+
+~~~bash
+cd hadoop-*
+vi conf/hadoop-env.sh
+~~~
+
+Uncomment and modify the following line in the file according to the path of
+your Java installation.
+
+~~~
+export JAVA_HOME=/path/to/java_home/
+~~~
+
+Save the changes and open the HDFS configuration file *conf/hdfs-site.xml*. HDFS
+offers multiple configuration parameters which affect the behavior of the
+distributed file system in various ways. The following excerpt shows a minimal
+configuration which is required to make HDFS work. More information on how to
+configure HDFS can be found in the [HDFS User
+Guide](http://hadoop.apache.org/docs/r1.2.1/hdfs_user_guide.html) guide.
+
+~~~xml
+<configuration>
+  <property>
+    <name>fs.default.name</name>
+    <value>hdfs://MASTER:50040/</value>
+  </property>
+  <property>
+    <name>dfs.data.dir</name>
+    <value>DATAPATH</value>
+  </property>
+</configuration>
+~~~
+
+Replace *MASTER* with the IP/host name of your master node which runs the
+*NameNode*. *DATAPATH* must be replaced with path to the directory in which the
+actual HDFS data shall be stored on each worker node. Make sure that the
+*flink* user has sufficient permissions to read and write in that
+directory.
+
+After having saved the HDFS configuration file, open the file *conf/slaves* and
+enter the IP/host name of those worker nodes which shall act as *DataNode*s.
+Each entry must be separated by a line break.
+
+~~~
+<worker 1>
+<worker 2>
+.
+.
+.
+<worker n>
+~~~
+
+Initialize the HDFS by typing in the following command. Note that the
+command will **delete all data** which has been previously stored in the
+HDFS. However, since we have just installed a fresh HDFS, it should be
+safe to answer the confirmation with *yes*.
+
+~~~bash
+bin/hadoop namenode -format
+~~~
+
+Finally, we need to make sure that the Hadoop directory is available to
+all worker nodes which are intended to act as DataNodes and that all nodes
+**find the directory under the same path**. We recommend to use a shared network
+directory (e.g. an NFS share) for that. Alternatively, one can copy the
+directory to all nodes (with the disadvantage that all configuration and
+code updates need to be synced to all nodes).
+
+### Starting HDFS
+
+To start the HDFS log on to the master and type in the following
+commands
+
+~~~bash
+cd hadoop-*
+bin/start-dfs.sh
+~~~
+
+If your HDFS setup is correct, you should be able to open the HDFS
+status website at *http://MASTER:50070*. In a matter of a seconds,
+all DataNodes should appear as live nodes. For troubleshooting we would
+like to point you to the [Hadoop Quick
+Start](http://wiki.apache.org/hadoop/QuickStart)
+guide.
+
+## Flink Setup
+
+Go to the [downloads page]({{site.baseurl}}/downloads.html) and get the ready to run
+package. Make sure to pick the Flink package **matching your Hadoop
+version**.
+
+After downloading the latest release, copy the archive to your master node and
+extract it:
+
+~~~bash
+tar xzf flink-*.tgz
+cd flink-*
+~~~
+
+### Configuring the Cluster
+
+After having extracted the system files, you need to configure Flink for
+the cluster by editing *conf/flink-conf.yaml*.
+
+Set the `jobmanager.rpc.address` key to point to your master node. Furthermode
+define the maximum amount of main memory the JVM is allowed to allocate on each
+node by setting the `jobmanager.heap.mb` and `taskmanager.heap.mb` keys.
+
+The value is given in MB. If some worker nodes have more main memory which you
+want to allocate to the Flink system you can overwrite the default value
+by setting an environment variable `FLINK_TM_HEAP` on the respective
+node.
+
+Finally you must provide a list of all nodes in your cluster which shall be used
+as worker nodes. Therefore, similar to the HDFS configuration, edit the file
+*conf/slaves* and enter the IP/host name of each worker node. Each worker node
+will later run a TaskManager.
+
+Each entry must be separated by a new line, as in the following example:
+
+~~~
+192.168.0.100
+192.168.0.101
+.
+.
+.
+192.168.0.150
+~~~
+
+The Flink directory must be available on every worker under the same
+path. Similarly as for HDFS, you can use a shared NSF directory, or copy the
+entire Flink directory to every worker node.
+
+Please see the [configuration page](config.html) for details and additional
+configuration options.
+
+In particular, 
+
+ * the amount of available memory per TaskManager (`taskmanager.heap.mb`), 
+ * the number of available CPUs per machine (`taskmanager.numberOfTaskSlots`),
+ * the total number of CPUs in the cluster (`parallelism.default`) and
+ * the temporary directories (`taskmanager.tmp.dirs`)
+
+are very important configuration values.
+
+
+### Starting Flink
+
+The following script starts a JobManager on the local node and connects via
+SSH to all worker nodes listed in the *slaves* file to start the
+TaskManager on each node. Now your Flink system is up and
+running. The JobManager running on the local node will now accept jobs
+at the configured RPC port.
+
+Assuming that you are on the master node and inside the Flink directory:
+
+~~~bash
+bin/start-cluster.sh
+~~~
+
+To stop Flink, there is also a `stop-cluster.sh` script.


Mime
View raw message