beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [3/4] beam-site git commit: Regenerate website
Date Fri, 21 Apr 2017 18:11:46 GMT
Regenerate website


Branch: refs/heads/asf-site
Commit: 876a895b9d4a49afbfdfe47d459bf164c8156deb
Parents: 0da810c
Author: Davor Bonaci <>
Authored: Fri Apr 21 11:11:22 2017 -0700
Committer: Davor Bonaci <>
Committed: Fri Apr 21 11:11:22 2017 -0700

 content/documentation/runners/apex/index.html | 58 +++++++++++++++++++++-
 1 file changed, 57 insertions(+), 1 deletion(-)
diff --git a/content/documentation/runners/apex/index.html b/content/documentation/runners/apex/index.html
index 875f3e0..c91f8d6 100644
--- a/content/documentation/runners/apex/index.html
+++ b/content/documentation/runners/apex/index.html
@@ -153,7 +153,63 @@
       <div class="row">
         <h1 id="using-the-apache-apex-runner">Using the Apache Apex Runner</h1>
-<p>This page is under construction (<a href="">BEAM-825</a>).</p>
+<p>The Apex Runner executes Apache Beam pipelines using <a href="">Apache
Apex</a> as an underlying engine. The runner has broad support for the <a href="/documentation/runners/capability-matrix/">Beam
model and supports streaming and batch pipelines</a>.</p>
+<p><a href="">Apache Apex</a> is a stream processing
platform and framework for low-latency, high-throughput and fault-tolerant analytics applications
on Apache Hadoop. Apex has a unified streaming architecture and can be used for real-time
and batch processing.</p>
+<h2 id="apex-runner-prerequisites">Apex Runner prerequisites</h2>
+<p>You may set up your own Hadoop cluster. Beam does not require anything extra to
launch the pipelines on YARN.
+An optional Apex installation may be useful for monitoring and troubleshooting.
+The Apex CLI can be <a href="">built</a>
+obtained as <a href="">binary
+For more download options see <a href="">distribution
information on the Apache Apex website</a>.</p>
+<h2 id="running-wordcount-using-apex-runner">Running wordcount using Apex Runner</h2>
+<p>Put data for processing into HDFS:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>hdfs dfs -mkdir
-p /tmp/input/
+hdfs dfs -put pom.xml /tmp/input/
+<p>The output directory should not exist on HDFS:</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>hdfs dfs -rm
-r -f /tmp/output/
+<p>Run the wordcount example (<em>example project needs to be modified to include
HDFS file provider</em>)</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>mvn compile
exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=/tmp/input/pom.xml
--output=/tmp/output/ --runner=ApexRunner --embeddedExecution=false"
+<p>The application will run asynchronously. Check status with <code class="highlighter-rouge">yarn
application -list -appStates ALL</code></p>
+<p>The configuration file is optional, it can be used to influence how Apex operators
are deployed into YARN containers.
+The following example will reduce the number of required containers by collocating the operators
into the same container
+and lower the heap memory per operator - suitable for execution in a single node Hadoop sandbox.</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>dt.application.*.operator.*.attr.MEMORY_MB=64*.prop.locality=CONTAINER_LOCAL
+<h2 id="checking-output">Checking output</h2>
+<p>Check the output of the pipeline in the HDFS location.</p>
+<div class="highlighter-rouge"><pre class="highlight"><code>hdfs dfs -ls
+<h2 id="montoring-progress-of-your-job">Montoring progress of your job</h2>
+<p>Depending on your installation, you may be able to monitor the progress of your
job on the Hadoop cluster. Alternatively, you have following options:</p>
+  <li>YARN : Using YARN web UI generally running on 8088 on the node running resource
+  <li>Apex command-line interface: <a href="">Using
the Apex CLI to get running application information</a>.</li>

View raw message