drill-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bridg...@apache.org
Subject drill git commit: delete obsolete parquet metadata caching
Date Fri, 02 Oct 2015 00:01:35 GMT
Repository: drill
Updated Branches:
  refs/heads/gh-pages 70415f269 -> eee4fae0e


delete obsolete parquet metadata caching


Project: http://git-wip-us.apache.org/repos/asf/drill/repo
Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/eee4fae0
Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/eee4fae0
Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/eee4fae0

Branch: refs/heads/gh-pages
Commit: eee4fae0ed26881b76b9b8d0082a64dbc11f38e1
Parents: 70415f2
Author: Kristine Hahn <khahn@maprtech.com>
Authored: Thu Oct 1 16:48:02 2015 -0700
Committer: Kristine Hahn <khahn@maprtech.com>
Committed: Thu Oct 1 16:48:02 2015 -0700

----------------------------------------------------------------------
 .../040-parquet-format.md                       | 18 -----------------
 _docs/getting-started/010-drill-introduction.md |  9 ++++-----
 .../020-querying-parquet-files.md               | 21 --------------------
 _docs/sql-reference/090-sql-extensions.md       | 10 +---------
 4 files changed, 5 insertions(+), 53 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/drill/blob/eee4fae0/_docs/data-sources-and-file-formats/040-parquet-format.md
----------------------------------------------------------------------
diff --git a/_docs/data-sources-and-file-formats/040-parquet-format.md b/_docs/data-sources-and-file-formats/040-parquet-format.md
index 814dc42..f63d611 100644
--- a/_docs/data-sources-and-file-formats/040-parquet-format.md
+++ b/_docs/data-sources-and-file-formats/040-parquet-format.md
@@ -20,24 +20,6 @@ Apache Drill includes the following support for Parquet:
 ## Reading Parquet Files
 When a read of Parquet data occurs, Drill loads only the necessary columns of data, which
reduces I/O. Reading only a small piece of the Parquet data from a data file or table, Drill
can examine and analyze all values for a column across multiple files. You can create a Drill
table from one format and store the data in another format, including Parquet.
 
-## Caching Metadata
-
-For performant querying of a large number of files, Drill 1.2 and later can take advantage
of metadata, such as the Hive metadata store, and includes the capability of generating a
metadata cache for performant querying of thousands of Parquet files. The metadata cache is
not a central caching system, but simply one or more files of metadata. Drill generates and
saves a cache of metadata in each directory in nested directories. You trigger the generation
of metadata caches by running the REFRESH TABLE METADATA command, as described in [Querying
Parquet Files]({{site.baseurl}}/docs/querying-parquet-files/).
-
-After generating the metadata cache, Drill performs the following tasks during the planning
phase for a query on a directory of Parquet files:
-
-* Finds files.  
-* Recurses directories.  
-* Reads the footers of files to get information, such as row counts and HDFS block locations
for every file for Drill to assign work based on locality.  
-  When Drill reads the file, it attempts to execute the query on the node where the data
rests.  
-* Summarizes the information from the footers in a single metadata cache file.  
-* Stores the metadata cache file at each level that covers that particular level and all
lower levels.
-
-At execution time, Drill reads the actual files. At planning time, Drill reads only the metadata
file. 
-
-The first query that does not see the metadata file will gather the metadata, so the elapsed
time of the first query will be very different from a subsequent 
-query. 
-
 ## Writing Parquet Files
 CREATE TABLE AS (CTAS) can use any data source provided by the storage plugin. To write Parquet
data using the CTAS command, set the session store.format option as shown in the next section.
Alternatively, configure the storage plugin to point to the directory containing the Parquet
files.
 

http://git-wip-us.apache.org/repos/asf/drill/blob/eee4fae0/_docs/getting-started/010-drill-introduction.md
----------------------------------------------------------------------
diff --git a/_docs/getting-started/010-drill-introduction.md b/_docs/getting-started/010-drill-introduction.md
index 49264a6..ae5c9db 100644
--- a/_docs/getting-started/010-drill-introduction.md
+++ b/_docs/getting-started/010-drill-introduction.md
@@ -9,7 +9,7 @@ applications, while still providing the familiarity and ecosystem of ANSI
SQL,
 the industry-standard query language. Drill provides plug-and-play integration
 with existing Apache Hive and Apache HBase deployments. 
 
-## What's New in Apache Drill 1.2
+<!-- ## What's New in Apache Drill 1.2
 
 This release of Drill fixes [many issues]() and introduces a number of enhancements, including
the following ones:
 
@@ -18,10 +18,9 @@ This release of Drill fixes [many issues]() and introduces a number of
enhanceme
   * [LEAD and LEAD]({{site.baseurl}}/docs/value-window-functions/#lag-lead)  
   * [FIRST_VALUE and LAST_VALUE]({{site.baseurl}}/docs/value-window-functions/#first_value-last_value)
 
 * [Security]({{site.baseurl}}/docs/configuring-web-console-and-rest-api-security/) for Web
Console and REST API operations  
-* Performance improvements for [querying HBase]({{site.baseurl}}/docs/querying-hbase/#querying-big-endian-encoded-data),
which includes leveraging [ordered byte encoding]({{site.baseurl}}/docs/querying-hbase/#leveraging-hbase-ordered-byte-encoding).
-* Parquet metadata caching for performantly reading large numbers of Parquet files
-* [Optimized reads]({{site.baseurl}}/docs/querying-hive/#optimizing-reads-of-parquet-backed-tables)
of Parquet-backed, Hive tables
-* Read support for the [Parquet INT96 type]({{site.baseurl}}/docs/parquet-format/#about-int96-support)
and a new TIMESTAMP_IMPALA type used with the [CONVERT_FROM]({{site.baseurl}}/docs/supported-data-types/#data-types-for-convert_to-and-convert_from-functions)
function decodes a timestamp from Hive or Impala.  
+* Performance improvements for [querying HBase]({{site.baseurl}}/docs/querying-hbase/#querying-big-endian-encoded-data),
which includes leveraging [ordered byte encoding]({{site.baseurl}}/docs/querying-hbase/#leveraging-hbase-ordered-byte-encoding)
 
+* [Optimized reads]({{site.baseurl}}/docs/querying-hive/#optimizing-reads-of-parquet-backed-tables)
of Parquet-backed, Hive tables  
+* Read support for the [Parquet INT96 type]({{site.baseurl}}/docs/parquet-format/#about-int96-support)
and a new TIMESTAMP_IMPALA type used with the [CONVERT_FROM]({{site.baseurl}}/docs/supported-data-types/#data-types-for-convert_to-and-convert_from-functions)
function decodes a timestamp from Hive or Impala.   -->
 
 ## What's New in Apache Drill 1.1
 

http://git-wip-us.apache.org/repos/asf/drill/blob/eee4fae0/_docs/query-data/query-a-file-system/020-querying-parquet-files.md
----------------------------------------------------------------------
diff --git a/_docs/query-data/query-a-file-system/020-querying-parquet-files.md b/_docs/query-data/query-a-file-system/020-querying-parquet-files.md
index b0228db..c86006f 100644
--- a/_docs/query-data/query-a-file-system/020-querying-parquet-files.md
+++ b/_docs/query-data/query-a-file-system/020-querying-parquet-files.md
@@ -3,27 +3,6 @@ title: "Querying Parquet Files"
 parent: "Querying a File System"
 ---
 
-Drill 1.2 and later extends SQL for performant querying of a large number, thousands or more,
of Parquet files. By running the following command, you trigger the generation of metadata
files in the directory of Parquet files and its subdirectories:
-
-`REFRESH TABLE METADATA <path to table>`
-
-You need to run the command on a file or directory only once during the session. Subsequent
queries return results quickly because Drill refers to the metadata saved in the cache, as
described in [Reading Parquet Files]({{site.baseurl}}/docs/parquet-format/#reading-parquet-files).

-
-You can query nested directories from any level. For example, you can query a sub-sub-directory
of Parquet files because Drill stores a metadata cache of information at each level that covers
that particular level and all lower levels. 
-
-## Example of Generating Parquet Metadata
-
-```
-0: jdbc:drill:schema=dfs> REFRESH TABLE METADATA t1;
-+-------+----------------------------------------------+
-|  ok   |                   summary                    |
-+-------+----------------------------------------------+
-| true  | Successfully updated metadata for table t1.  |
-+-------+----------------------------------------------+
-1 row selected (0.445 seconds)
-```
-
-## Sample Parquet Files  
 The Drill installation includes a `sample-data` directory with Parquet files
 that you can query. Use SQL to query the `region.parquet` and
 `nation.parquet` files in the `sample-data` directory.

http://git-wip-us.apache.org/repos/asf/drill/blob/eee4fae0/_docs/sql-reference/090-sql-extensions.md
----------------------------------------------------------------------
diff --git a/_docs/sql-reference/090-sql-extensions.md b/_docs/sql-reference/090-sql-extensions.md
index 31f594e..de8d989 100644
--- a/_docs/sql-reference/090-sql-extensions.md
+++ b/_docs/sql-reference/090-sql-extensions.md
@@ -2,20 +2,12 @@
 title: "SQL Extensions"
 parent: "SQL Reference"
 ---
-Drill extends SQL to generating Parquet metadata, to work with Hadoop-scale data, and to
explore smaller-scale data in ways not possible with SQL. Using intuitive SQL extensions you
work with self-describing data and complex data types. Extensions to SQL include capabilities
for exploring self-describing data, such as files and HBase, directly in the native format.
+Drill extends SQL to explore smaller-scale data in ways not possible with SQL. Using intuitive
SQL extensions you work with self-describing data and complex data types. Extensions to SQL
include capabilities for exploring self-describing data, such as files and HBase, directly
in the native format.
 
 Drill provides language support for pointing to [storage plugin]({{site.baseurl}}/docs/connect-a-data-source-introduction)
interfaces that Drill uses to interact with data sources. Use the name of a storage plugin
to specify a file system *database* as a prefix in queries when you refer to objects across
databases. Query files, including compressed .gz files, and [directories]({{ site.baseurl
}}/docs/querying-directories), as you would query an SQL table. You can query multiple files
in a directory.
 
 Drill extends the SELECT statement for reading complex, multi-structured data. The extended
CREATE TABLE AS provides the capability to write data of complex/multi-structured data types.
Drill extends the [lexical rules](http://drill.apache.org/docs/lexical-structure) for working
with files and directories, such as using back ticks for including file names, directory names,
and reserved words in queries. Drill syntax supports using the file system as a persistent
store for query profiles and diagnostic information.
 
-## Extension for Generating Parquet Metadata
-
-To speed querying of Parquet files, you can [generate metadata]({{site.baseurl}}/docs/querying-parquet-files/)
in Drill 1.2 and later. Running the following command triggers the generation of metadata
files in a directory of Parquet files and its subdirectories:
-
-`REFRESH TABLE METADATA <path to table>`
-
-Drill takes advantage of metadata, such as the Hive metadata store, and generates a [metadata
cache]({{site.baseurl}}/docs/parquet-format/#caching-metadata). Using metadata can improve
performance of queries on a large number of files. 
-
 ## Extensions for Hive- and HBase-related Data Sources
 
 Drill supports Hive and HBase as a plug-and-play data source. Drill can read tables created
in Hive that use [data types compatible]({{ site.baseurl }}/docs/hive-to-drill-data-type-mapping)
with Drill.  You can query Hive tables without modifications. You can query self-describing
data without requiring metadata definitions in the Hive metastore. Primitives, such as JOIN,
support columnar operation. 


Mime
View raw message