impala-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [1/4] incubator-impala git commit: IMPALA-4623: [DOCS] Document file handle caching
Date Fri, 06 Oct 2017 00:03:40 GMT
Repository: incubator-impala
Updated Branches:
  refs/heads/master 44ef4cb10 -> c14a09040

IMPALA-4623: [DOCS] Document file handle caching

Change-Id: I261c29eff80dc376528bba29ffb7d8e0f895e25f
Reviewed-by: Joe McDonnell <>
Tested-by: Impala Public Jenkins


Branch: refs/heads/master
Commit: d8bdea5e8206b47bf5bbcf8bda7b4edd7fec935a
Parents: 44ef4cb
Author: John Russell <>
Authored: Tue Oct 3 15:09:46 2017 -0700
Committer: Impala Public Jenkins <>
Committed: Thu Oct 5 23:49:43 2017 +0000

 docs/impala_keydefs.ditamap         |  1 +
 docs/topics/impala_known_issues.xml | 27 ++++++++++++++
 docs/topics/impala_scalability.xml  | 64 ++++++++++++++++++++++++++++++++
 3 files changed, 92 insertions(+)
diff --git a/docs/impala_keydefs.ditamap b/docs/impala_keydefs.ditamap
index 3068ca4..d92fdc5 100644
--- a/docs/impala_keydefs.ditamap
+++ b/docs/impala_keydefs.ditamap
@@ -10946,6 +10946,7 @@ under the License.
   <keydef href="topics/impala_scalability.xml#big_tables" keys="big_tables"/>
   <keydef href="topics/impala_scalability.xml#kerberos_overhead_cluster_size" keys="kerberos_overhead_cluster_size"/>
   <keydef href="topics/impala_scalability.xml#scalability_hotspots" keys="scalability_hotspots"/>
+  <keydef href="topics/impala_scalability.xml#scalability_file_handle_cache" keys="scalability_file_handle_cache"/>
   <keydef href="topics/impala_partitioning.xml" keys="partitioning"/>
   <keydef href="topics/impala_partitioning.xml#partitioning_choosing" keys="partitioning_choosing"/>
diff --git a/docs/topics/impala_known_issues.xml b/docs/topics/impala_known_issues.xml
index 51920c9..14ff4e3 100644
--- a/docs/topics/impala_known_issues.xml
+++ b/docs/topics/impala_known_issues.xml
@@ -331,6 +331,33 @@ - Don't have
+    <concept id="ki_file_handle_cache">
+      <title>Interaction of File Handle Cache with HDFS Appends and Short-Circuit Reads</title>
+      <conbody>
+        <p>
+          If a data file used by Impala is being continuously appended or overwritten in
place by an
+          HDFS mechanism, such as <cmdname>hdfs dfs -appendToFile</cmdname>,
interaction with the
+          file handle caching feature in <keyword keyref="impala210_full"/> and higher
could cause
+          short-circuit reads to sometimes be disabled on some DataNodes. When a mismatch
is detected
+          between the cached file handle and a data block that was rewritten because of an
+          short-circuit reads are turned off on the affected host for a 10-minute period.
+        </p>
+        <p>
+          The possibility of encountering such an issue is the reason why the file handle
+          feature is currently turned off by default. See <xref keyref="scalability_file_handle_cache"/>
+          for information about this feature and how to enable it.
+        </p>
+        <p><b>Bug:</b> <xref href=""
scope="external" format="html">HDFS-12528</xref></p>
+        <p><b>Severity:</b> High</p>
+        <!-- <p><b>Resolution:</b> </p> -->
+        <p><b>Workaround:</b> Verify whether your ETL process is susceptible
to this issue before enabling the file handle caching feature.
+          You can set the <cmdname>impalad</cmdname> configuration option <codeph>unused_file_handle_timeout_sec</codeph>
to a time period
+          that is shorter than the HDFS setting <codeph></codeph>.
(Keep in mind that
+          the HDFS setting is in milliseconds while the Impala setting is in seconds.)
+        </p>
+      </conbody>
+    </concept>
   <concept id="known_issues_usability">
diff --git a/docs/topics/impala_scalability.xml b/docs/topics/impala_scalability.xml
index 7c24fe2..2533e16 100644
--- a/docs/topics/impala_scalability.xml
+++ b/docs/topics/impala_scalability.xml
@@ -941,4 +941,68 @@ so other secure services might be affected temporarily.
+  <concept id="scalability_file_handle_cache" rev="2.10.0 IMPALA-4623">
+    <title>Scalability Considerations for NameNode Traffic with File Handle Caching</title>
+    <conbody>
+      <p>
+        One scalability aspect that affects heavily loaded clusters is the load on the HDFS
+        NameNode, from looking up the details as each HDFS file is opened. Impala queries
+        often access many different HDFS files, for example if a query does a full table
+        on a table with thousands of partitions, each partition containing multiple data
+        Accessing each column of a Parquet file also involves a separate <q>open</q>
+        further increasing the load on the NameNode. High NameNode overhead can add startup
+        (that is, increase latency) to Impala queries, and reduce overall throughput for
+        workloads that also require accessing HDFS files.
+      </p>
+      <p>
+        In <keyword keyref="impala210_full"/> and higher, you can reduce NameNode overhead
by enabling
+        a caching feature for HDFS file handles. Data files that are accessed by different
+        or even multiple times within the same query, can be accessed without a new <q>open</q>
+        call and without fetching the file details again from the NameNode.
+      </p>
+      <p>
+        Because this feature only involves HDFS data files, it does not apply to non-HDFS
+        such as Kudu or HBase tables, or tables that store their data on cloud services such
+        S3 or ADLS. Any read operations that perform remote reads also skip the cached file
+      </p>
+      <p>
+        This feature is turned off by default. To enable it, set the configuration option
+        <codeph>max_cached_file_handles</codeph> to a non-zero value for each
+        daemon. Consider an initial starting value of 20 thousand, and adjust upward if NameNode
+        overhead is still significant, or downward if it is more important to reduce the
extra memory usage
+        on each host. Each cache entry consumes 6 KB, meaning that caching 20,000 file handles
+        up to 120 MB on each DataNode. The exact memory usage varies depending on how many
file handles
+        have actually been cached; memory is freed as file handles are evicted from the cache.
+      </p>
+      <p>
+        If a manual HDFS operation moves a file to the HDFS Trashcan while the file handle
is cached,
+        Impala still accesses the contents of that file. This is a change from prior behavior.
+        accessing a file that was in the trashcan would cause an error. This behavior only
applies to
+        non-Impala methods of removing HDFS files, not the Impala mechanisms such as <codeph>TRUNCATE
+        or <codeph>DROP TABLE</codeph>.
+      </p>
+      <p>
+        If files are removed, replaced, or appended by HDFS operations outside of Impala,
the way to bring the
+        file information up to date is to run the <codeph>REFRESH</codeph> statement
on the table.
+      </p>
+      <p>
+        File handle cache entries are evicted as the cache fills up, or based on a timeout
+        when they have not been accessed for some time.
+      </p>
+      <p>
+        To evaluate the effectiveness of file handle caching for a particular workload, issue
+        <codeph>PROFILE</codeph> statement in <cmdname>impala-shell</cmdname>
or examine query
+        profiles in the Impala web UI. Look for the ratio of <codeph>CachedFileHandlesHitCount</codeph>
+        (ideally, should be high) to <codeph>CachedFileHandlesMissCount</codeph>
(ideally, should be low).
+        Before starting any evaluation, run some representative queries to <q>warm
up</q> the cache,
+        because the first time each data file is accessed is always recorded as a cache miss.
+        To see metrics about file handle caching for each <cmdname>impalad</cmdname>
+        examine the <uicontrol>/metrics</uicontrol> page in the Impala web UI,
in particular the fields
+        <uicontrol></uicontrol>,
+        <uicontrol></uicontrol>,
+        <uicontrol></uicontrol>.
+      </p>
+    </conbody>
+  </concept>

View raw message