flink-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From u..@apache.org
Subject [24/30] flink git commit: [docs] Change doc layout
Date Wed, 22 Apr 2015 14:17:21 GMT
http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/cli.md
----------------------------------------------------------------------
diff --git a/docs/cli.md b/docs/cli.md
deleted file mode 100644
index 7e80407..0000000
--- a/docs/cli.md
+++ /dev/null
@@ -1,190 +0,0 @@
----
-title:  "Command-Line Interface"
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-
-Flink provides a command-line interface to run programs that are packaged
-as JAR files, and control their execution.  The command line interface is part
-of any Flink setup, available in local single node setups and in
-distributed setups. It is located under `<flink-home>/bin/flink`
-and connects by default to the running Flink master (JobManager) that was
-started from the same installation directory.
-
-A prerequisite to using the command line interface is that the Flink
-master (JobManager) has been started (via `<flink-home>/bin/start-
-local.sh` or `<flink-home>/bin/start-cluster.sh`) or that a YARN
-environment is available.
-
-The command line can be used to
-
-- submit jobs for execution,
-- cancel a running job,
-- provide information about a job, and
-- list running and waiting jobs.
-
-## Examples
-
--   Run example program with no arguments.
-
-        ./bin/flink run ./examples/flink-java-examples-{{ site.FLINK_VERSION_SHORT }}-WordCount.jar
-
--   Run example program with arguments for input and result files
-
-        ./bin/flink run ./examples/flink-java-examples-{{ site.FLINK_VERSION_SHORT }}-WordCount.jar \
-                               file:///home/user/hamlet.txt file:///home/user/wordcount_out
-
--   Run example program with parallelism 16 and arguments for input and result files
-
-        ./bin/flink run -p 16 ./examples/flink-java-examples-{{ site.FLINK_VERSION_SHORT }}-WordCount.jar \
-                                file:///home/user/hamlet.txt file:///home/user/wordcount_out
-
--   Run example program on a specific JobManager:
-
-        ./bin/flink run -m myJMHost:6123 \
-                               ./examples/flink-java-examples-{{ site.FLINK_VERSION_SHORT }}-WordCount.jar \
-                               file:///home/user/hamlet.txt file:///home/user/wordcount_out
-
--   Run example program with a specific class as an entry point:
-
-        ./bin/flink run -c org.apache.flink.examples.java.wordcount.WordCount \
-                               ./examples/flink-java-examples-{{ site.FLINK_VERSION_SHORT }}-WordCount.jar \
-                               file:///home/user/hamlet.txt file:///home/user/wordcount_out
-
--   Run example program using a [per-job YARN cluster](yarn_setup.html#run-a-single-flink-job-on-hadoop-yarn) with 2 TaskManagers:
-
-        ./bin/flink run -m yarn-cluster -yn 2 \
-                               ./examples/flink-java-examples-{{ site.FLINK_VERSION_STABLE }}-WordCount.jar \
-                               hdfs:///user/hamlet.txt hdfs:///user/wordcount_out
-
--   Display the optimized execution plan for the WordCount example program as JSON:
-
-        ./bin/flink info ./examples/flink-java-examples-{{ site.FLINK_VERSION_SHORT }}-WordCount.jar \
-                                file:///home/user/hamlet.txt file:///home/user/wordcount_out
-
--   List scheduled and running jobs (including their JobIDs):
-
-        ./bin/flink list
-
--   List scheduled jobs (including their JobIDs):
-
-        ./bin/flink list -s
-
--   List running jobs (including their JobIDs):
-
-        ./bin/flink list -r
-
--   Cancel a job:
-
-        ./bin/flink cancel <jobID>
-
-## Usage
-
-The command line syntax is as follows:
-
-~~~
-./flink <ACTION> [OPTIONS] [ARGUMENTS]
-
-The following actions are available:
-
-Action "run" compiles and runs a program.
-
-  Syntax: run [OPTIONS] <jar-file> <arguments>
-  "run" action options:
-     -c,--class <classname>           Class with the program entry point ("main"
-                                      method or "getPlan()" method. Only needed
-                                      if the JAR file does not specify the class
-                                      in its manifest.
-     -m,--jobmanager <host:port>      Address of the JobManager (master) to
-                                      which to connect. Specify 'yarn-cluster'
-                                      as the JobManager to deploy a YARN cluster
-                                      for the job. Use this flag to connect to a
-                                      different JobManager than the one
-                                      specified in the configuration.
-     -p,--parallelism <parallelism>   The parallelism with which to run the
-                                      program. Optional flag to override the
-                                      default value specified in the
-                                      configuration.
-  Additional arguments if -m yarn-cluster is set:
-     -yD <arg>                            Dynamic properties
-     -yd,--yarndetached                   Start detached
-     -yj,--yarnjar <arg>                  Path to Flink jar file
-     -yjm,--yarnjobManagerMemory <arg>    Memory for JobManager Container [in
-                                          MB]
-     -yn,--yarncontainer <arg>            Number of YARN container to allocate
-                                          (=Number of Task Managers)
-     -yq,--yarnquery                      Display available YARN resources
-                                          (memory, cores)
-     -yqu,--yarnqueue <arg>               Specify YARN queue.
-     -ys,--yarnslots <arg>                Number of slots per TaskManager
-     -yt,--yarnship <arg>                 Ship files in the specified directory
-                                          (t for transfer)
-     -ytm,--yarntaskManagerMemory <arg>   Memory per TaskManager Container [in
-                                          MB]
-
-
-Action "info" shows the optimized execution plan of the program (JSON).
-
-  Syntax: info [OPTIONS] <jar-file> <arguments>
-  "info" action options:
-     -c,--class <classname>           Class with the program entry point ("main"
-                                      method or "getPlan()" method. Only needed
-                                      if the JAR file does not specify the class
-                                      in its manifest.
-     -m,--jobmanager <host:port>      Address of the JobManager (master) to
-                                      which to connect. Specify 'yarn-cluster'
-                                      as the JobManager to deploy a YARN cluster
-                                      for the job. Use this flag to connect to a
-                                      different JobManager than the one
-                                      specified in the configuration.
-     -p,--parallelism <parallelism>   The parallelism with which to run the
-                                      program. Optional flag to override the
-                                      default value specified in the
-                                      configuration.
-
-
-Action "list" lists running and scheduled programs.
-
-  Syntax: list [OPTIONS]
-  "list" action options:
-     -m,--jobmanager <host:port>   Address of the JobManager (master) to which
-                                   to connect. Specify 'yarn-cluster' as the
-                                   JobManager to deploy a YARN cluster for the
-                                   job. Use this flag to connect to a different
-                                   JobManager than the one specified in the
-                                   configuration.
-     -r,--running                  Show only running programs and their JobIDs
-     -s,--scheduled                Show only scheduled programs and their JobIDs
-
-
-Action "cancel" cancels a running program.
-
-  Syntax: cancel [OPTIONS] <Job ID>
-  "cancel" action options:
-     -m,--jobmanager <host:port>   Address of the JobManager (master) to which
-                                   to connect. Specify 'yarn-cluster' as the
-                                   JobManager to deploy a YARN cluster for the
-                                   job. Use this flag to connect to a different
-                                   JobManager than the one specified in the
-                                   configuration.
-~~~

http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/cluster_execution.md
----------------------------------------------------------------------
diff --git a/docs/cluster_execution.md b/docs/cluster_execution.md
deleted file mode 100644
index 34879dd..0000000
--- a/docs/cluster_execution.md
+++ /dev/null
@@ -1,146 +0,0 @@
----
-title:  "Cluster Execution"
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-Flink programs can run distributed on clusters of many machines. There
-are two ways to send a program to a cluster for execution:
-
-## Command Line Interface
-
-The command line interface lets you submit packaged programs (JARs) to a cluster
-(or single machine setup).
-
-Please refer to the [Command Line Interface](cli.html) documentation for
-details.
-
-## Remote Environment
-
-The remote environment lets you execute Flink Java programs on a cluster
-directly. The remote environment points to the cluster on which you want to
-execute the program.
-
-### Maven Dependency
-
-If you are developing your program as a Maven project, you have to add the
-`flink-clients` module using this dependency:
-
-~~~xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-clients</artifactId>
-  <version>{{ site.FLINK_VERSION_SHORT }}</version>
-</dependency>
-~~~
-
-### Example
-
-The following illustrates the use of the `RemoteEnvironment`:
-
-~~~java
-public static void main(String[] args) throws Exception {
-    ExecutionEnvironment env = ExecutionEnvironment
-        .createRemoteEnvironment("strato-master", "7661", "/home/user/udfs.jar");
-
-    DataSet<String> data = env.readTextFile("hdfs://path/to/file");
-
-    data
-        .filter(new FilterFunction<String>() {
-            public boolean filter(String value) {
-                return value.startsWith("http://");
-            }
-        })
-        .writeAsText("hdfs://path/to/result");
-
-    env.execute();
-}
-~~~
-
-Note that the program contains custom user code and hence requires a JAR file with
-the classes of the code attached. The constructor of the remote environment
-takes the path(s) to the JAR file(s).
-
-## Remote Executor
-
-Similar to the RemoteEnvironment, the RemoteExecutor lets you execute
-Flink programs on a cluster directly. The remote executor accepts a
-*Plan* object, which describes the program as a single executable unit.
-
-### Maven Dependency
-
-If you are developing your program in a Maven project, you have to add the
-`flink-clients` module using this dependency:
-
-~~~xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-clients</artifactId>
-  <version>{{ site.FLINK_VERSION_SHORT }}</version>
-</dependency>
-~~~
-
-### Example
-
-The following illustrates the use of the `RemoteExecutor` with the Scala API:
-
-~~~scala
-def main(args: Array[String]) {
-    val input = TextFile("hdfs://path/to/file")
-
-    val words = input flatMap { _.toLowerCase().split("""\W+""") filter { _ != "" } }
-    val counts = words groupBy { x => x } count()
-
-    val output = counts.write(wordsOutput, CsvOutputFormat())
-  
-    val plan = new ScalaPlan(Seq(output), "Word Count")
-    val executor = new RemoteExecutor("strato-master", 7881, "/path/to/jarfile.jar")
-    executor.executePlan(p);
-}
-~~~
-
-The following illustrates the use of the `RemoteExecutor` with the Java API (as
-an alternative to the RemoteEnvironment):
-
-~~~java
-public static void main(String[] args) throws Exception {
-    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
-
-    DataSet<String> data = env.readTextFile("hdfs://path/to/file");
-
-    data
-        .filter(new FilterFunction<String>() {
-            public boolean filter(String value) {
-                return value.startsWith("http://");
-            }
-        })
-        .writeAsText("hdfs://path/to/result");
-
-    Plan p = env.createProgramPlan();
-    RemoteExecutor e = new RemoteExecutor("strato-master", 7881, "/path/to/jarfile.jar");
-    e.executePlan(p);
-}
-~~~
-
-Note that the program contains custom UDFs and hence requires a JAR file with
-the classes of the code attached. The constructor of the remote executor takes
-the path(s) to the JAR file(s).

http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/cluster_setup.md
----------------------------------------------------------------------
diff --git a/docs/cluster_setup.md b/docs/cluster_setup.md
deleted file mode 100644
index 2ee9f3c..0000000
--- a/docs/cluster_setup.md
+++ /dev/null
@@ -1,346 +0,0 @@
----
-title:  "Cluster Setup"
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-This documentation is intended to provide instructions on how to run
-Flink in a fully distributed fashion on a static (but possibly
-heterogeneous) cluster.
-
-This involves two steps. First, installing and configuring Flink and
-second installing and configuring the [Hadoop Distributed
-Filesystem](http://hadoop.apache.org/) (HDFS).
-
-## Preparing the Cluster
-
-### Software Requirements
-
-Flink runs on all *UNIX-like environments*, e.g. **Linux**, **Mac OS X**,
-and **Cygwin** (for Windows) and expects the cluster to consist of **one master
-node** and **one or more worker nodes**. Before you start to setup the system,
-make sure you have the following software installed **on each node**:
-
-- **Java 1.6.x** or higher,
-- **ssh** (sshd must be running to use the Flink scripts that manage
-  remote components)
-
-If your cluster does not fulfill these software requirements you will need to
-install/upgrade it.
-
-For example, on Ubuntu Linux, type in the following commands to install Java and
-ssh:
-
-~~~bash
-sudo apt-get install ssh 
-sudo apt-get install openjdk-7-jre
-~~~
-
-You can check the correct installation of Java by issuing the following command:
-
-~~~bash
-java -version
-~~~
-
-The command should output something comparable to the following on every node of
-your cluster (depending on your Java version, there may be small differences):
-
-~~~bash
-java version "1.6.0_22"
-Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
-Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
-~~~
-
-To make sure the ssh daemon is running properly, you can use the command
-
-~~~bash
-ps aux | grep sshd
-~~~
-
-Something comparable to the following line should appear in the output
-of the command on every host of your cluster:
-
-~~~bash
-root       894  0.0  0.0  49260   320 ?        Ss   Jan09   0:13 /usr/sbin/sshd
-~~~
-
-### Configuring Remote Access with ssh
-
-In order to start/stop the remote processes, the master node requires access via
-ssh to the worker nodes. It is most convenient to use ssh's public key
-authentication for this. To setup public key authentication, log on to the
-master as the user who will later execute all the Flink components. **The
-same user (i.e. a user with the same user name) must also exist on all worker
-nodes**. For the remainder of this instruction we will refer to this user as
-*flink*. Using the super user *root* is highly discouraged for security
-reasons.
-
-Once you logged in to the master node as the desired user, you must generate a
-new public/private key pair. The following command will create a new
-public/private key pair into the *.ssh* directory inside the home directory of
-the user *flink*. See the ssh-keygen man page for more details. Note that
-the private key is not protected by a passphrase.
-
-~~~bash
-ssh-keygen -b 2048 -P '' -f ~/.ssh/id_rsa
-~~~
-
-Next, copy/append the content of the file *.ssh/id_rsa.pub* to your
-authorized_keys file. The content of the authorized_keys file defines which
-public keys are considered trustworthy during the public key authentication
-process. On most systems the appropriate command is
-
-~~~bash
-cat .ssh/id_rsa.pub >> .ssh/authorized_keys
-~~~
-
-On some Linux systems, the authorized keys file may also be expected by the ssh
-daemon under *.ssh/authorized_keys2*. In either case, you should make sure the
-file only contains those public keys which you consider trustworthy for each
-node of cluster.
-
-Finally, the authorized keys file must be copied to every worker node of your
-cluster. You can do this by repeatedly typing in
-
-~~~bash
-scp .ssh/authorized_keys <worker>:~/.ssh/
-~~~
-
-and replacing *\<worker\>* with the host name of the respective worker node.
-After having finished the copy process, you should be able to log on to each
-worker node from your master node via ssh without a password.
-
-### Setting JAVA_HOME on each Node
-
-Flink requires the `JAVA_HOME` environment variable to be set on the
-master and all worker nodes and point to the directory of your Java
-installation.
-
-You can set this variable in `conf/flink-conf.yaml` via the
-`env.java.home` key.
-
-Alternatively, add the following line to your shell profile. If you use the
-*bash* shell (probably the most common shell), the shell profile is located in
-*\~/.bashrc*:
-
-~~~bash
-export JAVA_HOME=/path/to/java_home/
-~~~
-
-If your ssh daemon supports user environments, you can also add `JAVA_HOME` to
-*.\~/.ssh/environment*. As super user *root* you can enable ssh user
-environments with the following commands:
-
-~~~bash
-echo "PermitUserEnvironment yes" >> /etc/ssh/sshd_config
-/etc/init.d/ssh restart
-~~~
-
-## Hadoop Distributed Filesystem (HDFS) Setup
-
-The Flink system currently uses the Hadoop Distributed Filesystem (HDFS)
-to read and write data in a distributed fashion. It is possible to use
-Flink without HDFS or other distributed file systems.
-
-Make sure to have a running HDFS installation. The following instructions are
-just a general overview of some required settings. Please consult one of the
-many installation guides available online for more detailed instructions.
-
-__Note that the following instructions are based on Hadoop 1.2 and might differ 
-for Hadoop 2.__
-
-### Downloading, Installing, and Configuring HDFS
-
-Similar to the Flink system HDFS runs in a distributed fashion. HDFS
-consists of a **NameNode** which manages the distributed file system's meta
-data. The actual data is stored by one or more **DataNodes**. For the remainder
-of this instruction we assume the HDFS's NameNode component runs on the master
-node while all the worker nodes run an HDFS DataNode.
-
-To start, log on to your master node and download Hadoop (which includes  HDFS)
-from the Apache [Hadoop Releases](http://hadoop.apache.org/releases.html) page.
-
-Next, extract the Hadoop archive.
-
-After having extracted the Hadoop archive, change into the Hadoop directory and
-edit the Hadoop environment configuration file:
-
-~~~bash
-cd hadoop-*
-vi conf/hadoop-env.sh
-~~~
-
-Uncomment and modify the following line in the file according to the path of
-your Java installation.
-
-~~~
-export JAVA_HOME=/path/to/java_home/
-~~~
-
-Save the changes and open the HDFS configuration file *conf/hdfs-site.xml*. HDFS
-offers multiple configuration parameters which affect the behavior of the
-distributed file system in various ways. The following excerpt shows a minimal
-configuration which is required to make HDFS work. More information on how to
-configure HDFS can be found in the [HDFS User
-Guide](http://hadoop.apache.org/docs/r1.2.1/hdfs_user_guide.html) guide.
-
-~~~xml
-<configuration>
-  <property>
-    <name>fs.default.name</name>
-    <value>hdfs://MASTER:50040/</value>
-  </property>
-  <property>
-    <name>dfs.data.dir</name>
-    <value>DATAPATH</value>
-  </property>
-</configuration>
-~~~
-
-Replace *MASTER* with the IP/host name of your master node which runs the
-*NameNode*. *DATAPATH* must be replaced with path to the directory in which the
-actual HDFS data shall be stored on each worker node. Make sure that the
-*flink* user has sufficient permissions to read and write in that
-directory.
-
-After having saved the HDFS configuration file, open the file *conf/slaves* and
-enter the IP/host name of those worker nodes which shall act as *DataNode*s.
-Each entry must be separated by a line break.
-
-~~~
-<worker 1>
-<worker 2>
-.
-.
-.
-<worker n>
-~~~
-
-Initialize the HDFS by typing in the following command. Note that the
-command will **delete all data** which has been previously stored in the
-HDFS. However, since we have just installed a fresh HDFS, it should be
-safe to answer the confirmation with *yes*.
-
-~~~bash
-bin/hadoop namenode -format
-~~~
-
-Finally, we need to make sure that the Hadoop directory is available to
-all worker nodes which are intended to act as DataNodes and that all nodes
-**find the directory under the same path**. We recommend to use a shared network
-directory (e.g. an NFS share) for that. Alternatively, one can copy the
-directory to all nodes (with the disadvantage that all configuration and
-code updates need to be synced to all nodes).
-
-### Starting HDFS
-
-To start the HDFS log on to the master and type in the following
-commands
-
-~~~bash
-cd hadoop-*
-bin/start-dfs.sh
-~~~
-
-If your HDFS setup is correct, you should be able to open the HDFS
-status website at *http://MASTER:50070*. In a matter of a seconds,
-all DataNodes should appear as live nodes. For troubleshooting we would
-like to point you to the [Hadoop Quick
-Start](http://wiki.apache.org/hadoop/QuickStart)
-guide.
-
-## Flink Setup
-
-Go to the [downloads page]({{site.baseurl}}/downloads.html) and get the ready to run
-package. Make sure to pick the Flink package **matching your Hadoop
-version**.
-
-After downloading the latest release, copy the archive to your master node and
-extract it:
-
-~~~bash
-tar xzf flink-*.tgz
-cd flink-*
-~~~
-
-### Configuring the Cluster
-
-After having extracted the system files, you need to configure Flink for
-the cluster by editing *conf/flink-conf.yaml*.
-
-Set the `jobmanager.rpc.address` key to point to your master node. Furthermode
-define the maximum amount of main memory the JVM is allowed to allocate on each
-node by setting the `jobmanager.heap.mb` and `taskmanager.heap.mb` keys.
-
-The value is given in MB. If some worker nodes have more main memory which you
-want to allocate to the Flink system you can overwrite the default value
-by setting an environment variable `FLINK_TM_HEAP` on the respective
-node.
-
-Finally you must provide a list of all nodes in your cluster which shall be used
-as worker nodes. Therefore, similar to the HDFS configuration, edit the file
-*conf/slaves* and enter the IP/host name of each worker node. Each worker node
-will later run a TaskManager.
-
-Each entry must be separated by a new line, as in the following example:
-
-~~~
-192.168.0.100
-192.168.0.101
-.
-.
-.
-192.168.0.150
-~~~
-
-The Flink directory must be available on every worker under the same
-path. Similarly as for HDFS, you can use a shared NSF directory, or copy the
-entire Flink directory to every worker node.
-
-Please see the [configuration page](config.html) for details and additional
-configuration options.
-
-In particular, 
-
- * the amount of available memory per TaskManager (`taskmanager.heap.mb`), 
- * the number of available CPUs per machine (`taskmanager.numberOfTaskSlots`),
- * the total number of CPUs in the cluster (`parallelism.default`) and
- * the temporary directories (`taskmanager.tmp.dirs`)
-
-are very important configuration values.
-
-
-### Starting Flink
-
-The following script starts a JobManager on the local node and connects via
-SSH to all worker nodes listed in the *slaves* file to start the
-TaskManager on each node. Now your Flink system is up and
-running. The JobManager running on the local node will now accept jobs
-at the configured RPC port.
-
-Assuming that you are on the master node and inside the Flink directory:
-
-~~~bash
-bin/start-cluster.sh
-~~~
-
-To stop Flink, there is also a `stop-cluster.sh` script.

http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/coding_guidelines.md
----------------------------------------------------------------------
diff --git a/docs/coding_guidelines.md b/docs/coding_guidelines.md
deleted file mode 100644
index a126f2b..0000000
--- a/docs/coding_guidelines.md
+++ /dev/null
@@ -1,23 +0,0 @@
----
-title:  "Coding Guidelines"
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-The coding guidelines are now located [on the project website](http://flink.apache.org/coding_guidelines.html).

http://git-wip-us.apache.org/repos/asf/flink/blob/f1ee90cc/docs/config.md
----------------------------------------------------------------------
diff --git a/docs/config.md b/docs/config.md
deleted file mode 100644
index 1068152..0000000
--- a/docs/config.md
+++ /dev/null
@@ -1,385 +0,0 @@
----
-title:  "Configuration"
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Overview
-
-The default configuration parameters allow Flink to run out-of-the-box
-in single node setups.
-
-This page lists the most common options that are typically needed to set
-up a well performing (distributed) installation. In addition a full
-list of all available configuration parameters is listed here.
-
-All configuration is done in `conf/flink-conf.yaml`, which is expected to be
-a flat collection of [YAML key value pairs](http://www.yaml.org/spec/1.2/spec.html)
-with format `key: value`.
-
-The system and run scripts parse the config at startup time. Changes to the configuration
-file require restarting the Flink JobManager and TaskManagers.
-
-The configuration files for the TaskManagers can be different, Flink does not assume 
-uniform machines in the cluster.
-
-
-## Common Options
-
-- `env.java.home`: The path to the Java installation to use (DEFAULT: system's
-default Java installation, if found). Needs to be specified if the startup
-scipts fail to automatically resolve the java home directory. Can be specified
-to point to a specific java installation or version. If this option is not
-specified, the startup scripts also evaluate the `$JAVA_HOME` environment variable.
-
-- `jobmanager.rpc.address`: The IP address of the JobManager, which is the
-master/coordinator of the distributed system (DEFAULT: localhost).
-
-- `jobmanager.rpc.port`: The port number of the JobManager (DEFAULT: 6123).
-
-- `jobmanager.heap.mb`: JVM heap size (in megabytes) for the JobManager. You may have to increase the heap size for the JobManager if you are running
-very large applications (with many operators), or if you are keeping a long history of them.
-
-- `taskmanager.heap.mb`: JVM heap size (in megabytes) for the TaskManagers,
-which are the parallel workers of the system. In
-contrast to Hadoop, Flink runs operators (e.g., join, aggregate) and
-user-defined functions (e.g., Map, Reduce, CoGroup) inside the TaskManager
-(including sorting/hashing/caching), so this value should be as
-large as possible. If the cluster is exclusively running Flink,
-the total amount of available memory per machine minus some memory for the 
-operating system (maybe 1-2 GB) is a good value.
-On YARN setups, this value is automatically configured to the size of 
-the TaskManager's YARN container, minus a certain tolerance value.
-
-- `taskmanager.numberOfTaskSlots`: The number of parallel operator or
-user function instances that a single TaskManager can run (DEFAULT: 1).
-If this value is larger than 1, a single TaskManager takes multiple instances of
-a function or operator. That way, the TaskManager can utilize multiple CPU cores,
-but at the same time, the available memory is divided between the different
-operator or function instances.
-This value is typically proportional to the number of physical CPU cores that
-the TaskManager's machine has (e.g., equal to the number of cores, or half the
-number of cores). [More about task slots](config.html#configuring-taskmanager-processing-slots).
-
-- `parallelism.default`: The default parallelism to use for programs that have
-no parallelism specified. (DEFAULT: 1). For setups that have no concurrent jobs
-running, setting this value to NumTaskManagers * NumSlotsPerTaskManager will
-cause the system to use all available execution resources for the program's
-execution. **Note**: The default parallelism can be overwriten for an entire
-job by calling `setParallelism(int parallelism)` on the `ExecutionEnvironment`
-or by passing `-p <parallelism>` to the Flink Command-line frontend. It can be
-overwritten for single transformations by calling `setParallelism(int
-parallelism)` on an operator. See the [programming
-guide](programming_guide.html#parallel-execution) for more information about the
-parallelism.
-
-- `fs.hdfs.hadoopconf`: The absolute path to the Hadoop File System's (HDFS)
-configuration directory (OPTIONAL VALUE).
-Specifying this value allows programs to reference HDFS files using short URIs
-(`hdfs:///path/to/files`, without including the address and port of the NameNode
-in the file URI). Without this option, HDFS files can be accessed, but require
-fully qualified URIs like `hdfs://address:port/path/to/files`.
-This option also causes file writers to pick up the HDFS's default values for block sizes
-and replication factors. Flink will look for the "core-site.xml" and
-"hdfs-site.xml" files in teh specified directory.
-
-
-## Advanced Options
-
-- `taskmanager.tmp.dirs`: The directory for temporary files, or a list of
-directories separated by the systems directory delimiter (for example ':'
-(colon) on Linux/Unix). If multiple directories are specified, then the temporary
-files will be distributed across the directories in a round-robin fashion. The
-I/O manager component will spawn one reading and one writing thread per
-directory. A directory may be listed multiple times to have the I/O manager use
-multiple threads for it (for example if it is physically stored on a very fast
-disc or RAID) (DEFAULT: The system's tmp dir).
-
-- `jobmanager.web.port`: Port of the JobManager's web interface (DEFAULT: 8081).
-
-- `fs.overwrite-files`: Specifies whether file output writers should overwrite
-existing files by default. Set to *true* to overwrite by default, *false* otherwise.
-(DEFAULT: false)
-
-- `fs.output.always-create-directory`: File writers running with a parallelism
-larger than one create a directory for the output file path and put the different
-result files (one per parallel writer task) into that directory. If this option
-is set to *true*, writers with a parallelism of 1 will also create a directory
-and place a single result file into it. If the option is set to *false*, the
-writer will directly create the file directly at the output path, without
-creating a containing directory. (DEFAULT: false)
-
-- `taskmanager.network.numberOfBuffers`: The number of buffers available to the
-network stack. This number determines how many streaming data exchange channels
-a TaskManager can have at the same time and how well buffered the channels are.
-If a job is rejected or you get a warning that the system has not enough buffers
-available, increase this value (DEFAULT: 2048).
-
-- `taskmanager.memory.size`: The amount of memory (in megabytes) that the task
-manager reserves on the JVM's heap space for sorting, hash tables, and caching
-of intermediate results. If unspecified (-1), the memory manager will take a fixed
-ratio of the heap memory available to the JVM, as specified by
-`taskmanager.memory.fraction`. (DEFAULT: -1)
-
-- `taskmanager.memory.fraction`: The relative amount of memory that the task
-manager reserves for sorting, hash tables, and caching of intermediate results.
-For example, a value of 0.8 means that TaskManagers reserve 80% of the
-JVM's heap space for internal data buffers, leaving 20% of the JVM's heap space
-free for objects created by user-defined functions. (DEFAULT: 0.7)
-This parameter is only evaluated, if `taskmanager.memory.size` is not set.
-
-
-## Full Reference
-
-### HDFS
-
-These parameters configure the default HDFS used by Flink. Setups that do not
-specify a HDFS configuration have to specify the full path to 
-HDFS files (`hdfs://address:port/path/to/files`) Files will also be written
-with default HDFS parameters (block size, replication factor).
-
-- `fs.hdfs.hadoopconf`: The absolute path to the Hadoop configuration directory.
-The system will look for the "core-site.xml" and "hdfs-site.xml" files in that
-directory (DEFAULT: null).
-- `fs.hdfs.hdfsdefault`: The absolute path of Hadoop's own configuration file
-"hdfs-default.xml" (DEFAULT: null).
-- `fs.hdfs.hdfssite`: The absolute path of Hadoop's own configuration file
-"hdfs-site.xml" (DEFAULT: null).
-
-### JobManager &amp; TaskManager
-
-The following parameters configure Flink's JobManager and TaskManagers.
-
-- `jobmanager.rpc.address`: The IP address of the JobManager, which is the
-master/coordinator of the distributed system (DEFAULT: localhost).
-- `jobmanager.rpc.port`: The port number of the JobManager (DEFAULT: 6123).
-- `taskmanager.rpc.port`: The task manager's IPC port (DEFAULT: 6122).
-- `taskmanager.data.port`: The task manager's port used for data exchange
-operations (DEFAULT: 6121).
-- `jobmanager.heap.mb`: JVM heap size (in megabytes) for the JobManager
-(DEFAULT: 256).
-- `taskmanager.heap.mb`: JVM heap size (in megabytes) for the TaskManagers,
-which are the parallel workers of the system. In
-contrast to Hadoop, Flink runs operators (e.g., join, aggregate) and
-user-defined functions (e.g., Map, Reduce, CoGroup) inside the TaskManager
-(including sorting/hashing/caching), so this value should be as
-large as possible (DEFAULT: 512). On YARN setups, this value is automatically
-configured to the size of the TaskManager's YARN container, minus a
-certain tolerance value.
-- `taskmanager.numberOfTaskSlots`: The number of parallel operator or
-user function instances that a single TaskManager can run (DEFAULT: 1).
-If this value is larger than 1, a single TaskManager takes multiple instances of
-a function or operator. That way, the TaskManager can utilize multiple CPU cores,
-but at the same time, the available memory is divided between the different
-operator or function instances.
-This value is typically proportional to the number of physical CPU cores that
-the TaskManager's machine has (e.g., equal to the number of cores, or half the
-number of cores).
-- `taskmanager.tmp.dirs`: The directory for temporary files, or a list of
-directories separated by the systems directory delimiter (for example ':'
-(colon) on Linux/Unix). If multiple directories are specified, then the temporary
-files will be distributed across the directories in a round robin fashion. The
-I/O manager component will spawn one reading and one writing thread per
-directory. A directory may be listed multiple times to have the I/O manager use
-multiple threads for it (for example if it is physically stored on a very fast
-disc or RAID) (DEFAULT: The system's tmp dir).
-- `taskmanager.network.numberOfBuffers`: The number of buffers available to the
-network stack. This number determines how many streaming data exchange channels
-a TaskManager can have at the same time and how well buffered the channels are.
-If a job is rejected or you get a warning that the system has not enough buffers
-available, increase this value (DEFAULT: 2048).
-- `taskmanager.network.bufferSizeInBytes`: The size of the network buffers, in
-bytes (DEFAULT: 32768 (= 32 KiBytes)).
-- `taskmanager.memory.size`: The amount of memory (in megabytes) that the task
-manager reserves on the JVM's heap space for sorting, hash tables, and caching
-of intermediate results. If unspecified (-1), the memory manager will take a fixed
-ratio of the heap memory available to the JVM, as specified by
-`taskmanager.memory.fraction`. (DEFAULT: -1)
-- `taskmanager.memory.fraction`: The relative amount of memory that the task
-manager reserves for sorting, hash tables, and caching of intermediate results.
-For example, a value of 0.8 means that TaskManagers reserve 80% of the
-JVM's heap space for internal data buffers, leaving 20% of the JVM's heap space
-free for objects created by user-defined functions. (DEFAULT: 0.7)
-This parameter is only evaluated, if `taskmanager.memory.size` is not set.
-- `jobclient.polling.interval`: The interval (in seconds) in which the client
-polls the JobManager for the status of its job (DEFAULT: 2).
-- `taskmanager.runtime.max-fan`: The maximal fan-in for external merge joins and
-fan-out for spilling hash tables. Limits the number of file handles per operator,
-but may cause intermediate merging/partitioning, if set too small (DEFAULT: 128).
-- `taskmanager.runtime.sort-spilling-threshold`: A sort operation starts spilling
-when this fraction of its memory budget is full (DEFAULT: 0.8).
-- `taskmanager.heartbeat-interval`: The interval in which the TaskManager sends
-heartbeats to the JobManager.
-- `jobmanager.max-heartbeat-delay-before-failure.msecs`: The maximum time that a
-TaskManager hearbeat may be missing before the TaskManager is considered failed.
-
-### Distributed Coordination (via Akka)
-
-- `akka.ask.timeout`: Timeout used for all futures and blocking Akka calls. If Flink fails due to timeouts then you should try to increase this value. Timeouts can be caused by slow machines or a congested network. The timeout value requires a time-unit specifier (ms/s/min/h/d) (DEFAULT: **100 s**).
-- `akka.lookup.timeout`: Timeout used for the lookup of the JobManager. The timeout value has to contain a time-unit specifier (ms/s/min/h/d) (DEFAULT: **10 s**).
-- `akka.framesize`: Maximum size of messages which are sent between the JobManager and the TaskManagers. If Flink fails because messages exceed this limit, then you should increase it. The message size requires a size-unit specifier (DEFAULT: **10485760b**).
-- `akka.watch.heartbeat.interval`: Heartbeat interval for Akka's DeathWatch mechanism to detect dead TaskManagers. If TaskManagers are wrongly marked dead because of lost or delayed heartbeat messages, then you should increase this value. A thorough description of Akka's DeathWatch can be found [here](http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#failure-detector) (DEFAULT: **akka.ask.timeout/10**).
-- `akka.watch.heartbeat.pause`: Acceptable heartbeat pause for Akka's DeathWatch mechanism. A low value does not allow a irregular heartbeat. A thorough description of Akka's DeathWatch can be found [here](http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#failure-detector) (DEFAULT: **akka.ask.timeout**).
-- `akka.watch.threshold`: Threshold for the DeathWatch failure detector. A low value is prone to false positives whereas a high value increases the time to detect a dead TaskManager. A thorough description of Akka's DeathWatch can be found [here](http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#failure-detector) (DEFAULT: **12**).
-- `akka.transport.heartbeat.interval`: Heartbeat interval for Akka's transport failure detector. Since Flink uses TCP, the detector is not necessary. Therefore, the detector is disabled by setting the interval to a very high value. In case you should need the transport failure detector, set the interval to some reasonable value. The interval value requires a time-unit specifier (ms/s/min/h/d) (DEFAULT: **1000 s**).
-- `akka.transport.heartbeat.pause`: Acceptable heartbeat pause for Akka's transport failure detector. Since Flink uses TCP, the detector is not necessary. Therefore, the detector is disabled by setting the pause to a very high value. In case you should need the transport failure detector, set the pause to some reasonable value. The pause value requires a time-unit specifier (ms/s/min/h/d) (DEFAULT: **6000 s**).
-- `akka.transport.threshold`: Threshold for the transport failure detector. Since Flink uses TCP, the detector is not necessary and, thus, the threshold is set to a high value (DEFAULT: **300**).
-- `akka.tcp.timeout`: Timeout for all outbound connections. If you should experience problems with connecting to a TaskManager due to a slow network, you should increase this value (DEFAULT: **akka.ask.timeout**).
-- `akka.throughput`: Number of messages that are processed in a batch before returning the thread to the pool. Low values denote a fair scheduling whereas high values can increase the performance at the cost of unfairness (DEFAULT: **15**).
-- `akka.log.lifecycle.events`: Turns on the Akka's remote logging of events. Set this value to 'on' in case of debugging (DEFAULT: **off**).
-- `akka.startup-timeout`: Timeout after which the startup of a remote component is considered being failed (DEFAULT: **akka.ask.timeout**).
-
-### JobManager Web Frontend
-
-- `jobmanager.web.port`: Port of the JobManager's web interface that displays
-status of running jobs and execution time breakdowns of finished jobs
-(DEFAULT: 8081). Setting this value to `-1` disables the web frontend.
-- `jobmanager.web.history`: The number of latest jobs that the JobManager's web
-front-end in its history (DEFAULT: 5).
-
-### Webclient
-
-These parameters configure the web interface that can be used to submit jobs and
-review the compiler's execution plans.
-
-- `webclient.port`: The port of the webclient server (DEFAULT: 8080).
-- `webclient.tempdir`: The temp directory for the web server. Used for example
-for caching file fragments during file-uploads (DEFAULT: The system's temp
-directory).
-- `webclient.uploaddir`: The directory into which the web server will store
-uploaded programs (DEFAULT: ${webclient.tempdir}/webclient-jobs/).
-- `webclient.plandump`: The directory into which the web server will dump
-temporary JSON files describing the execution plans
-(DEFAULT: ${webclient.tempdir}/webclient-plans/).
-
-### File Systems
-
-The parameters define the behavior of tasks that create result files.
-
-- `fs.overwrite-files`: Specifies whether file output writers should overwrite
-existing files by default. Set to *true* to overwrite by default, *false* otherwise.
-(DEFAULT: false)
-- `fs.output.always-create-directory`: File writers running with a parallelism
-larger than one create a directory for the output file path and put the different
-result files (one per parallel writer task) into that directory. If this option
-is set to *true*, writers with a parallelism of 1 will also create a directory
-and place a single result file into it. If the option is set to *false*, the
-writer will directly create the file directly at the output path, without
-creating a containing directory. (DEFAULT: false)
-
-### Compiler/Optimizer
-
-- `compiler.delimited-informat.max-line-samples`: The maximum number of line
-samples taken by the compiler for delimited inputs. The samples are used to
-estimate the number of records. This value can be overridden for a specific
-input with the input format's parameters (DEFAULT: 10).
-- `compiler.delimited-informat.min-line-samples`: The minimum number of line
-samples taken by the compiler for delimited inputs. The samples are used to
-estimate the number of records. This value can be overridden for a specific
-input with the input format's parameters (DEFAULT: 2).
-- `compiler.delimited-informat.max-sample-len`: The maximal length of a line
-sample that the compiler takes for delimited inputs. If the length of a single
-sample exceeds this value (possible because of misconfiguration of the parser),
-the sampling aborts. This value can be overridden for a specific input with the
-input format's parameters (DEFAULT: 2097152 (= 2 MiBytes)).
-
-## YARN
-
-Please note that all ports used by Flink in a YARN session are offsetted by the YARN application ID
-to avoid duplicate port allocations when running multiple YARN sessions in parallel. 
-
-So if `yarn.am.rpc.port` is configured to `10245` and the session's application ID is `application_1406629969999_0002`, then the actual port being used is 10245 + 2 = 10247
-
-- `yarn.heap-cutoff-ratio`: Percentage of heap space to remove from containers started by YARN.
-
-
-## Background
-
-### Configuring the Network Buffers
-
-Network buffers are a critical resource for the communication layers. They are
-used to buffer records before transmission over a network, and to buffer
-incoming data before dissecting it into records and handing them to the
-application. A sufficient number of network buffers is critical to achieve a
-good throughput.
-
-In general, configure the task manager to have enough buffers that each logical
-network connection on you expect to be open at the same time has a dedicated
-buffer. A logical network connection exists for each point-to-point exchange of
-data over the network, which typically happens at repartitioning- or
-broadcasting steps. In those, each parallel task inside the TaskManager has to
-be able to talk to all other parallel tasks. Hence, the required number of
-buffers on a task manager is *total-degree-of-parallelism* (number of targets)
-\* *intra-node-parallelism* (number of sources in one task manager) \* *n*.
-Here, *n* is a constant that defines how many repartitioning-/broadcasting steps
-you expect to be active at the same time.
-
-Since the *intra-node-parallelism* is typically the number of cores, and more
-than 4 repartitioning or broadcasting channels are rarely active in parallel, it
-frequently boils down to *\#cores\^2\^* \* *\#machines* \* 4. To support for
-example a cluster of 20 8-core machines, you should use roughly 5000 network
-buffers for optimal throughput.
-
-Each network buffer has by default a size of 32 KiBytes. In the above example, the
-system would allocate roughly 300 MiBytes for network buffers.
-
-The number and size of network buffers can be configured with the following
-parameters:
-
-- `taskmanager.network.numberOfBuffers`, and
-- `taskmanager.network.bufferSizeInBytes`.
-
-### Configuring Temporary I/O Directories
-
-Although Flink aims to process as much data in main memory as possible,
-it is not uncommon that more data needs to be processed than memory is
-available. Flink's runtime is designed to write temporary data to disk
-to handle these situations.
-
-The `taskmanager.tmp.dirs` parameter specifies a list of directories into which
-Flink writes temporary files. The paths of the directories need to be
-separated by ':' (colon character). Flink will concurrently write (or
-read) one temporary file to (from) each configured directory. This way,
-temporary I/O can be evenly distributed over multiple independent I/O devices
-such as hard disks to improve performance. To leverage fast I/O devices (e.g.,
-SSD, RAID, NAS), it is possible to specify a directory multiple times.
-
-If the `taskmanager.tmp.dirs` parameter is not explicitly specified,
-Flink writes temporary data to the temporary directory of the operating
-system, such as */tmp* in Linux systems.
-
-
-### Configuring TaskManager processing slots
-
-Flink executes a program in parallel by splitting it into subtasks and scheduling these subtasks to processing slots.
-
-Each Flink TaskManager provides processing slots in the cluster. The number of slots
-is typically proportional to the number of available CPU cores __of each__ TaskManager.
-As a general recommendation, the number of available CPU cores is a good default for 
-`taskmanager.numberOfTaskSlots`.
-
-When starting a Flink application, users can supply the default number of slots to use for that job.
-The command line value therefore is called `-p` (for parallelism). In addition, it is possible
-to [set the number of slots in the programming APIs](programming_guide.html#parallel-execution) for 
-the whole application and individual operators.
-
-<img src="img/slots_parallelism.svg" class="img-responsive" />
\ No newline at end of file


Mime
View raw message