drill-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bridg...@apache.org
Subject drill git commit: doc update for DRILL-3867
Date Thu, 10 Aug 2017 22:25:30 GMT
Repository: drill
Updated Branches:
  refs/heads/gh-pages 49e50be68 -> 3272b2e74


doc update for DRILL-3867


Project: http://git-wip-us.apache.org/repos/asf/drill/repo
Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/3272b2e7
Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/3272b2e7
Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/3272b2e7

Branch: refs/heads/gh-pages
Commit: 3272b2e74abb9e685518925d3f61ef4d06b84172
Parents: 49e50be
Author: Bridget Bevens <bbevens@maprtech.com>
Authored: Thu Aug 10 15:24:35 2017 -0700
Committer: Bridget Bevens <bbevens@maprtech.com>
Committed: Thu Aug 10 15:24:35 2017 -0700

----------------------------------------------------------------------
 .../025-optimizing-parquet-reading.md           | 41 ++++++++++++--------
 1 file changed, 24 insertions(+), 17 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/drill/blob/3272b2e7/_docs/performance-tuning/025-optimizing-parquet-reading.md
----------------------------------------------------------------------
diff --git a/_docs/performance-tuning/025-optimizing-parquet-reading.md b/_docs/performance-tuning/025-optimizing-parquet-reading.md
index f2d5141..65884af 100644
--- a/_docs/performance-tuning/025-optimizing-parquet-reading.md
+++ b/_docs/performance-tuning/025-optimizing-parquet-reading.md
@@ -1,41 +1,48 @@
 ---
 title: "Optimizing Parquet Metadata Reading"
-date: 2016-02-08 21:57:13 UTC
+date: 2017-08-10 22:24:37 UTC
 parent: "Performance Tuning"
 ---
 
-Parquet metadata caching is an optional feature in Drill 1.2 and later. When you use this
feature, Drill generates a metadata cache file. Drill stores the metadata cache file in a
directory you specify and its subdirectories. When you run a query on this directory or a
subdirectory, Drill reads a single metadata cache file instead of retrieving metadata from
multiple Parquet files during the query-planning phase.
+Parquet metadata caching is a feature that enables Drill to read a single metadata cache
file instead of retrieving metadata from multiple Parquet files during the query-planning
phase. 
+Parquet metadata caching is available for Parquet data in Drill 1.2 and later. To enable
Parquet metadata caching, issue the REFRESH TABLE METADATA <path to table> command.
When you run this command Drill generates a metadata cache file.  
 
-Parquet metadata caching is useful only with Parquet data, and does not benefit queries on
Hive tables, HBase tables, or text files. 
+{% include startnote.html %}Parquet metadata caching does not benefit queries on Hive tables,
HBase tables, or text files.{% include endnote.html %}  
+
+Drill stores the metadata cache file in the specified directory and subdirectories. When
you run a query on this directory or subdirectories, Drill reads the metadata cache file instead
of retrieving metadata from multiple Parquet files during the query-planning phase.     
+
+In Drill 1.11 and later, Drill stores the paths to the Parquet files as relative paths instead
of absolute paths. You can move partitioned Parquet directories from one location in the distributed
files system to another without issuing the REFRESH TABLE METADATA command to rebuild the
Parquet metadata files; the metadata remains valid in the new location.   
+
+{% include startnote.html %}Reverting back to a previous version of Drill from 1.11 is not
recommended because Drill will incorrectly interpret the Parquet metadata files created by
Drill 1.11. Should this occur, remove the Parquet metadata files and run the refresh table
metadata command to rebuild the files in the older format.{% include endnote.html %} 
+ 
 
 ## When to Use Parquet Metadata Caching
 
-The scenarios in which metadata caching is useful is when the planning time is a significant
percentage of the total elapsed time of the query. If the query execution time is the dominant
factor, which is typically observed with a large number of files, then metadata caching will
have very little impact. To determine that query execution time is the dominant factor, run
an EXPLAIN plan on your query of a large number of files, and compare its time to the total
time of query execution. Use the comparison to determine whether metadata caching will be
useful.
+Metadata caching is useful when planning time is a significant percentage of the total elapsed
time of the query. If the query execution time is the dominant factor, which is typically
observed with a large number of files, then metadata caching will have very little impact.
To determine that query execution time is the dominant factor, run an EXPLAIN plan on your
query of a large number of files, and compare its time to the total time of query execution.
Use the comparison to determine whether metadata caching will be useful.
 
 When enabled, Drill always uses the Parquet metadata cache during the query-planning phase.
To optimize reading Parquet metadata, make sure the metadata cache is up-to-date after making
any changes, such as inserts, to the data in the cluster. The next section describes how to
update the metadata cache.
 
 
-## How to Trigger Generation of the Parquet Metadata Cache File
+## Generating the Parquet Metadata Cache File
 
 The following command generates the Parquet metadata cache file in the `<path to table>`
and its subdirectories.
 
-`REFRESH TABLE METADATA <path to table>`
+       REFRESH TABLE METADATA <path to table>
 
 You need to run this command on a directory, nested or flat, only once during the session.
Only the first query gathers the metadata unless the Parquet data changes, for example, you
delete some data. If you did not make changes to the Parquet data, subsequent queries encounter
the up-to-date Parquet metadata files. There is no need for Drill to regenerate the metadata.
If there are changes, the metadata needs updating, so Drill dynamically regenerates the Parquet
metadata when you issue the next query.
 
-The elapsed time of the first query that triggers regeneration of metadata can be greater
than that of subsequent queries that use that metadata. If this increase in the time of the
first query is unacceptable, make sure the cache is up-to-date by running the REFRESH TABLE
METADATA command.
+The elapsed time of the first query that triggers regeneration of metadata can be greater
than that of subsequent queries that use that metadata. If this increase in the time of the
first query is unacceptable, make sure the cache is up-to-date by running the REFRESH TABLE
METADATA command, as shown in the following example:
+
 
-## Example of Generating Parquet Metadata
+       0: jdbc:drill:schema=dfs> REFRESH TABLE METADATA t1;
+       +-------+----------------------------------------------+
+       |  ok   |                   summary                    |
+       +-------+----------------------------------------------+
+       | true  | Successfully updated metadata for table t1.  |
+       +-------+----------------------------------------------+
+       1 row selected (0.445 seconds)  
+  
 
-```
-0: jdbc:drill:schema=dfs> REFRESH TABLE METADATA t1;
-+-------+----------------------------------------------+
-|  ok   |                   summary                    |
-+-------+----------------------------------------------+
-| true  | Successfully updated metadata for table t1.  |
-+-------+----------------------------------------------+
-1 row selected (0.445 seconds)
-```
 
 ## How Drill Generates and Uses Parquet Metadata
 


Mime
View raw message