flink-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From se...@apache.org
Subject flink-web git commit: [faq] Add FAQ entry about YARN container kills due to exceeding allowed memory
Date Thu, 04 Aug 2016 10:01:19 GMT
Repository: flink-web
Updated Branches:
  refs/heads/asf-site f7fd98e87 -> 75ddc959f


[faq] Add FAQ entry about YARN container kills due to exceeding allowed memory

Also improve answer about lost TaskManager.


Project: http://git-wip-us.apache.org/repos/asf/flink-web/repo
Commit: http://git-wip-us.apache.org/repos/asf/flink-web/commit/75ddc959
Tree: http://git-wip-us.apache.org/repos/asf/flink-web/tree/75ddc959
Diff: http://git-wip-us.apache.org/repos/asf/flink-web/diff/75ddc959

Branch: refs/heads/asf-site
Commit: 75ddc959f36a52977ca4f8d6747b5192ba7e775c
Parents: f7fd98e
Author: Stephan Ewen <sewen@apache.org>
Authored: Thu Aug 4 11:59:59 2016 +0200
Committer: Stephan Ewen <sewen@apache.org>
Committed: Thu Aug 4 11:59:59 2016 +0200

----------------------------------------------------------------------
 content/blog/feed.xml         | 16 ++++++++--------
 content/blog/page2/index.html |  2 +-
 content/faq.html              | 24 ++++++++++++++++++++++--
 faq.md                        | 23 +++++++++++++++++++++--
 4 files changed, 52 insertions(+), 13 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/flink-web/blob/75ddc959/content/blog/feed.xml
----------------------------------------------------------------------
diff --git a/content/blog/feed.xml b/content/blog/feed.xml
index 1cd168c..cfc5424 100644
--- a/content/blog/feed.xml
+++ b/content/blog/feed.xml
@@ -45,7 +45,7 @@
 &lt;img src=&quot;/img/blog/stream-sql/old-table-api.png&quot; style=&quot;width:75%;margin:15px&quot;
/&gt;
 &lt;/center&gt;
 
-&lt;p&gt;A Table is created from a DataSet or DataStream and transformed into a new
Table by applying relational transformations such as &lt;code&gt;filter&lt;/code&gt;,
&lt;code&gt;join&lt;/code&gt;, or &lt;code&gt;select&lt;/code&gt;
on them. Internally, a logical table operator tree is constructed from the applied Table transformations.
When a Table is translated back into a DataSet or DataStream, the respective translator translates
the logical operator tree into DataSet or DataStream operators. Expressions like &lt;code&gt;&#39;location.like(&quot;room%&quot;)&lt;/code&gt;
are compiled into Flink functions via code generation.&lt;/p&gt;
+&lt;p&gt;A Table is created from a DataSet or DataStream and transformed into a new
Table by applying relational transformations such as &lt;code&gt;filter&lt;/code&gt;,
&lt;code&gt;join&lt;/code&gt;, or &lt;code&gt;select&lt;/code&gt;
on them. Internally, a logical table operator tree is constructed from the applied Table transformations.
When a Table is translated back into a DataSet or DataStream, the respective translator translates
the logical operator tree into DataSet or DataStream operators. Expressions like &lt;code&gt;'location.like(&quot;room%&quot;)&lt;/code&gt;
are compiled into Flink functions via code generation.&lt;/p&gt;
 
 &lt;p&gt;However, the original Table API had a few limitations. First of all, it
could not stand alone. Table API queries had to be always embedded into a DataSet or DataStream
program. Queries against batch Tables did not support outer joins, sorting, and many scalar
functions which are commonly used in SQL queries. Queries against streaming tables only supported
filters, union, and projections and no aggregations or joins. Also, the translation process
did not leverage query optimization techniques except for the physical optimization that is
applied to all DataSet programs.&lt;/p&gt;
 
@@ -3678,7 +3678,7 @@ programs.&lt;/p&gt;
 </item>
 
 <item>
-<title>Peeking into Apache Flink&#39;s Engine Room</title>
+<title>Peeking into Apache Flink's Engine Room</title>
 <description>&lt;h3 id=&quot;join-processing-in-apache-flink&quot;&gt;Join
Processing in Apache Flink&lt;/h3&gt;
 
 &lt;p&gt;Joins are prevalent operations in many data processing applications. Most
data processing systems feature APIs that make joining data sets very easy. However, the internal
algorithms for join processing are much more involved – especially if large data sets need
to be efficiently handled. Therefore, join processing serves as a good example to discuss
the salient design points and implementation details of a data processing system.&lt;/p&gt;
@@ -4189,12 +4189,12 @@ INFO    Socket Stream(1/1) switched to DEPLOYING
 INFO    Custom Source(1/1) switched to SCHEDULED 
 INFO    Custom Source(1/1) switched to DEPLOYING
 …
-1&amp;gt; StockPrice{symbol=&#39;SPX&#39;, count=1011.3405732645239}
-2&amp;gt; StockPrice{symbol=&#39;SPX&#39;, count=1018.3381290039248}
-1&amp;gt; StockPrice{symbol=&#39;DJI&#39;, count=1036.7454894073978}
-3&amp;gt; StockPrice{symbol=&#39;DJI&#39;, count=1135.1170217478427}
-3&amp;gt; StockPrice{symbol=&#39;BUX&#39;, count=1053.667523187687}
-4&amp;gt; StockPrice{symbol=&#39;BUX&#39;, count=1036.552601487263}
+1&amp;gt; StockPrice{symbol='SPX', count=1011.3405732645239}
+2&amp;gt; StockPrice{symbol='SPX', count=1018.3381290039248}
+1&amp;gt; StockPrice{symbol='DJI', count=1036.7454894073978}
+3&amp;gt; StockPrice{symbol='DJI', count=1135.1170217478427}
+3&amp;gt; StockPrice{symbol='BUX', count=1053.667523187687}
+4&amp;gt; StockPrice{symbol='BUX', count=1036.552601487263}
 &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
 
 &lt;p&gt;&lt;a href=&quot;#top&quot;&gt;Back to top&lt;/a&gt;&lt;/p&gt;

http://git-wip-us.apache.org/repos/asf/flink-web/blob/75ddc959/content/blog/page2/index.html
----------------------------------------------------------------------
diff --git a/content/blog/page2/index.html b/content/blog/page2/index.html
index 685e3f9..41e38a3 100644
--- a/content/blog/page2/index.html
+++ b/content/blog/page2/index.html
@@ -244,7 +244,7 @@ Apache Flink started.</p>
       <h2 class="blog-title"><a href="/news/2015/08/24/introducing-flink-gelly.html">Introducing
Gelly: Graph Processing with Apache Flink</a></h2>
       <p>24 Aug 2015</p>
 
-      <p><p>This blog post introduces <strong>Gelly</strong>, Apache
Flink’s <em>graph-processing API and library</em>. Flink’s native support
+      <p><p>This blog post introduces <strong>Gelly</strong>, Apache
Flink&#8217;s <em>graph-processing API and library</em>. Flink&#8217;s
native support
 for iterations makes it a suitable platform for large-scale graph analytics.
 By leveraging delta iterations, Gelly is able to map various graph processing models such
as
 vertex-centric or gather-sum-apply to Flink dataflows.</p>

http://git-wip-us.apache.org/repos/asf/flink-web/blob/75ddc959/content/faq.html
----------------------------------------------------------------------
diff --git a/content/faq.html b/content/faq.html
index 30f34d7..5d1279f 100644
--- a/content/faq.html
+++ b/content/faq.html
@@ -210,6 +210,7 @@ under the License.
   </li>
   <li><a href="#yarn-deployment" id="markdown-toc-yarn-deployment">YARN Deployment</a>
   <ul>
       <li><a href="#the-yarn-session-runs-only-for-a-few-seconds" id="markdown-toc-the-yarn-session-runs-only-for-a-few-seconds">The
YARN session runs only for a few seconds</a></li>
+      <li><a href="#my-yarn-containers-are-killed-because-they-use-too-much-memory"
id="markdown-toc-my-yarn-containers-are-killed-because-they-use-too-much-memory">My YARN
containers are killed because they use too much memory</a></li>
       <li><a href="#the-yarn-session-crashes-with-a-hdfs-permission-exception-during-startup"
id="markdown-toc-the-yarn-session-crashes-with-a-hdfs-permission-exception-during-startup">The
YARN session crashes with a HDFS permission exception during startup</a></li>
       <li><a href="#my-job-is-not-reacting-to-a-job-cancellation" id="markdown-toc-my-job-is-not-reacting-to-a-job-cancellation">My
job is not reacting to a job cancellation?</a></li>
     </ul>
@@ -502,8 +503,10 @@ inefficient and disk space consuming if used for large input data.</p>
 
 <h3 id="the-slot-allocated-for-my-task-manager-has-been-released-what-should-i-do">The
slot allocated for my task manager has been released. What should I do?</h3>
 
-<p>A <code>java.lang.Exception: The slot in which the task was executed has been
released. Probably loss of TaskManager</code> usually occurs when there are big garbage
collection stalls.
-In this case, a quick fix would be to use the G1 garbage collector. It works incrementally
and it often leads to lower pauses. Furthermore, you can dedicate more memory to the user
code (e.g. 0.4 per system and 0.6 per user).</p>
+<p>If you see a <code>java.lang.Exception: The slot in which the task was executed
has been released. Probably loss of TaskManager</code> even though the TaskManager did
actually not crash, it 
+means that the TaskManager was unresponsive for a time. That can be due to network issues,
but is frequently due to long garbage collection stalls.
+In this case, a quick fix would be to use an incremental Garbage Collector, like the G1 garbage
collector. It usually leads to shorter pauses. Furthermore, you can dedicate more memory to
+the user code by reducing the amount of memory Flink grabs for its internal operations (see
configuration of TaskManager managed memory).</p>
 
 <p>If both of these approaches fail and the error persists, simply increase the TaskManager’s
heartbeat pause by setting AKKA_WATCH_HEARTBEAT_PAUSE (akka.watch.heartbeat.pause) to a greater
value (e.g. 600s).
 This will cause the JobManager to wait for a heartbeat for a longer time interval before
considering the TaskManager lost.</p>
@@ -551,6 +554,23 @@ this happened. You see messages from Linux’ <a href="http://linux-mm.org/OOM_K
   </li>
 </ul>
 
+<h3 id="my-yarn-containers-are-killed-because-they-use-too-much-memory">My YARN containers
are killed because they use too much memory</h3>
+
+<p>This is usually indicated my a log message like the following one:</p>
+
+<div class="highlight"><pre><code>Container container_e05_1467433388200_0136_01_000002
is completed with diagnostics: Container [pid=5832,containerID=container_e05_1467433388200_0136_01_000002]
is running beyond physical memory limits. Current usage: 2.3 GB of 2 GB physical memory used;
6.1 GB of 4.2 GB virtual memory used. Killing container.
+</code></pre></div>
+
+<p>In that case, the JVM process grew too large. Because the Java heap size is always
limited, the extra memory typically comes from non-heap sources:</p>
+
+<ul>
+  <li>Libraries that use off-heap memory. (Flink’s own off-heap memory is limited
and taken into account when calculating the allowed heap size.)</li>
+  <li>PermGen space (strings and classes), code caches, memory mapped jar files</li>
+  <li>Native libraries (RocksDB)</li>
+</ul>
+
+<p>You can activate the <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#memory-and-performance-debugging">memory
debug logger</a> to get more insight into what memory pool is actually using up too
much memory.</p>
+
 <h3 id="the-yarn-session-crashes-with-a-hdfs-permission-exception-during-startup">The
YARN session crashes with a HDFS permission exception during startup</h3>
 
 <p>While starting the YARN session, you are receiving an exception like this:</p>

http://git-wip-us.apache.org/repos/asf/flink-web/blob/75ddc959/faq.md
----------------------------------------------------------------------
diff --git a/faq.md b/faq.md
index 238e51d..ec39254 100644
--- a/faq.md
+++ b/faq.md
@@ -292,8 +292,10 @@ inefficient and disk space consuming if used for large input data.
 
 ### The slot allocated for my task manager has been released. What should I do?
 
-A `java.lang.Exception: The slot in which the task was executed has been released. Probably
loss of TaskManager` usually occurs when there are big garbage collection stalls.
-In this case, a quick fix would be to use the G1 garbage collector. It works incrementally
and it often leads to lower pauses. Furthermore, you can dedicate more memory to the user
code (e.g. 0.4 per system and 0.6 per user).
+If you see a `java.lang.Exception: The slot in which the task was executed has been released.
Probably loss of TaskManager` even though the TaskManager did actually not crash, it 
+means that the TaskManager was unresponsive for a time. That can be due to network issues,
but is frequently due to long garbage collection stalls.
+In this case, a quick fix would be to use an incremental Garbage Collector, like the G1 garbage
collector. It usually leads to shorter pauses. Furthermore, you can dedicate more memory to
+the user code by reducing the amount of memory Flink grabs for its internal operations (see
configuration of TaskManager managed memory).
 
 If both of these approaches fail and the error persists, simply increase the TaskManager's
heartbeat pause by setting AKKA_WATCH_HEARTBEAT_PAUSE (akka.watch.heartbeat.pause) to a greater
value (e.g. 600s).
 This will cause the JobManager to wait for a heartbeat for a longer time interval before
considering the TaskManager lost.
@@ -336,6 +338,23 @@ YARN configuration is wrong and more memory than physically available
is
 configured. Execute `dmesg` on the machine where the AM was running to see if
 this happened. You see messages from Linux' [OOM killer](http://linux-mm.org/OOM_Killer).
 
+### My YARN containers are killed because they use too much memory
+
+This is usually indicated my a log message like the following one:
+
+~~~
+Container container_e05_1467433388200_0136_01_000002 is completed with diagnostics: Container
[pid=5832,containerID=container_e05_1467433388200_0136_01_000002] is running beyond physical
memory limits. Current usage: 2.3 GB of 2 GB physical memory used; 6.1 GB of 4.2 GB virtual
memory used. Killing container.
+~~~
+
+In that case, the JVM process grew too large. Because the Java heap size is always limited,
the extra memory typically comes from non-heap sources:
+
+  - Libraries that use off-heap memory. (Flink's own off-heap memory is limited and taken
into account when calculating the allowed heap size.)
+  - PermGen space (strings and classes), code caches, memory mapped jar files
+  - Native libraries (RocksDB)
+
+You can activate the [memory debug logger](https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#memory-and-performance-debugging)
to get more insight into what memory pool is actually using up too much memory.
+
+
 ### The YARN session crashes with a HDFS permission exception during startup
 
 While starting the YARN session, you are receiving an exception like this:


Mime
View raw message