drill-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From krish...@apache.org
Subject [04/11] drill git commit: migration tool docs
Date Mon, 14 Dec 2015 23:48:55 GMT
migration tool docs


Project: http://git-wip-us.apache.org/repos/asf/drill/repo
Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/965bfbf1
Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/965bfbf1
Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/965bfbf1

Branch: refs/heads/gh-pages
Commit: 965bfbf1ace6f5f05793902600a7568111579350
Parents: 161af8f
Author: Kris Hahn <krishahn@apache.org>
Authored: Mon Dec 14 10:07:59 2015 -0800
Committer: Kris Hahn <krishahn@apache.org>
Committed: Mon Dec 14 15:46:37 2015 -0800

----------------------------------------------------------------------
 .../performance-tuning/020-partition-pruning.md | 118 -------------------
 .../010-partition-pruning-introduction.md       |  21 ++++
 .../020-migrating-partitioned-data.md           |  50 ++++++++
 .../partition-pruning/030-partition-pruning.md  | 111 +++++++++++++++++
 4 files changed, 182 insertions(+), 118 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/drill/blob/965bfbf1/_docs/performance-tuning/020-partition-pruning.md
----------------------------------------------------------------------
diff --git a/_docs/performance-tuning/020-partition-pruning.md b/_docs/performance-tuning/020-partition-pruning.md
old mode 100644
new mode 100755
index 26f681f..ddf67eb
--- a/_docs/performance-tuning/020-partition-pruning.md
+++ b/_docs/performance-tuning/020-partition-pruning.md
@@ -2,121 +2,3 @@
 title: "Partition Pruning"
 parent: "Performance Tuning"
 --- 
-
-Partition pruning is a performance optimization that limits the number of files and partitions
that Drill reads when querying file systems and Hive tables. When you partition data, Drill
only reads a subset of the files that reside in a file system or a subset of the partitions
in a Hive table when a query matches certain filter criteria.
- 
-The query planner in Drill performs partition pruning by evaluating the filters. If no partition
filters are present, the underlying Scan operator reads all files in all directories and then
sends the data to operators, such as Filter, downstream. When partition filters are present,
the query planner pushes the filters down to the Scan if possible. The Scan reads only the
directories that match the partition filters, thus reducing disk I/O.
-
-## Migrating Partitioned Data from Drill 1.1-1.2 to Drill 1.3
-Use the [drill-upgrade tool](https://github.com/parthchandra/drill-upgrade) to migrate Parquet
data that you generated in Drill 1.1 or 1.2 before attempting to use the data with Drill 1.3
partition pruning.  This migration is mandatory because Parquet data generated by Drill 1.1
and 1.2 must be marked as Drill-generated, as described in [DRILL-4070](https://issues.apache.org/jira/browse/DRILL-4070).

-
-Drill 1.3 fixes a bug to accurately process Parquet files produced by other tools, such as
Pig and Hive. The bug fix eliminated the risk of inaccurate metadata that could cause incorrect
results when querying Hive- and Pig-generated Parquet files. No such risk exists with Drill-generated
Parquet files. Querying Drill-generated Parquet files, regardless of the Drill version, yields
accurate results. Drill-generated Parquet files, regardless of the Drill release, contain
accurate metadata.
-
-After using the drill-upgrade tool to migrate your partitioned, pre-1.3 Parquet data, Drill
can distinguish these files from those generated by other tools, such as Hive and Pig. Use
the migration tool only on files generated by Drill. 
-
-To partition and query Parquet files generated from other tools, use Drill to read and rewrite
the files and metadata using the CTAS command with the PARTITION BY clause. Alternatively,
use the tool that generated the original files to regenerate Parquet 1.8 or later files.
-
-## How to Partition Data
-
-In Drill 1.1.0 and later, if the data source is Parquet, no data organization tasks are required
to take advantage of partition pruning. Write Parquet data using the [PARTITION BY]({{site.baseurl}}/docs/partition-by-clause/)
clause in the CTAS statement. 
-
-The Parquet writer first sorts data by the partition keys, and then creates a new file when
it encounters a new value for the partition columns. During partitioning, Drill creates separate
files, but not separate directories, for different partitions. Each file contains exactly
one partition value, but there can be multiple files for the same partition value. 
-
-Partition pruning uses the Parquet column statistics to determine which columns to use to
prune. 
-
-Unlike using the Drill 1.0 partitioning, no view query is subsequently required, nor is it
necessary to use the [dir* variables]({{site.baseurl}}/docs/querying-directories) after you
use the Drill 1.1 PARTITION BY clause in a CTAS statement. 
-
-## Drill 1.0 Partitioning
-
-You perform the following steps to partition data in Drill 1.0.   
- 
-1. Devise a logical way to store the data in a hierarchy of directories. 
-2. Use CTAS to create Parquet files from the original data, specifying filter conditions.
-3. Move the files into directories in the hierarchy. 
-
-After partitioning the data, you need to create a view of the partitioned data to query the
data. You can use the [dir* variables]({{site.baseurl}}/docs/querying-directories) in queries
to refer to subdirectories in your workspace path.
- 
-### Drill 1.0 Partitioning Example
-
-Suppose you have text files containing several years of log data. To partition the data by
year and quarter, create the following hierarchy of directories:  
-       
-       …/logs/1994/Q1  
-       …/logs/1994/Q2  
-       …/logs/1994/Q3  
-       …/logs/1994/Q4  
-       …/logs/1995/Q1  
-       …/logs/1995/Q2  
-       …/logs/1995/Q3  
-       …/logs/1995/Q4  
-       …/logs/1996/Q1  
-       …/logs/1996/Q2  
-       …/logs/1996/Q3  
-       …/logs/1996/Q4  
-
-Run the following CTAS statement, filtering on the Q1 1994 data.
- 
-          CREATE TABLE TT_1994_Q1 
-              AS SELECT * FROM <raw table data in text format >
-              WHERE columns[1] = 1994 AND columns[2] = 'Q1'
- 
-This creates a Parquet file with the log data for Q1 1994 in the current workspace.  You
can then move the file into the correlating directory, and repeat the process until all of
the files are stored in their respective directories.
-
-Now you can define views on the parquet files and query the views.  
-
-       0: jdbc:drill:zk=local> create view vv1 as select `dir0` as `year`, `dir1` as `qtr`
from dfs.`/Users/max/data/multilevel/parquet`;
-       +------------+------------+
-       |     ok     |  summary   |
-       +------------+------------+
-       | true       | View 'vv1' created successfully in 'dfs.tmp' schema |
-       +------------+------------+
-       1 row selected (0.16 seconds)  
-
-Query the view to see all of the logs.  
-
-       0: jdbc:drill:zk=local> select * from dfs.tmp.vv1;
-       +------------+------------+
-       |    year    |    qtr     |
-       +------------+------------+
-       | 1994       | Q1         |
-       | 1994       | Q3         |
-       | 1994       | Q3         |
-       | 1994       | Q4         |
-       | 1994       | Q4         |
-       | 1994       | Q4         |
-       | 1994       | Q4         |
-       | 1995       | Q2         |
-       | 1995       | Q2         |
-       | 1995       | Q2         |
-       | 1995       | Q2         |
-       | 1995       | Q4         |
-       | 1995       | Q4         |
-       | 1995       | Q4         |
-       | 1995       | Q4         |
-       | 1995       | Q4         |
-       | 1995       | Q4         |
-       | 1995       | Q4         |
-       | 1996       | Q1         |
-       | 1996       | Q1         |
-       | 1996       | Q1         |
-       | 1996       | Q1         |
-       | 1996       | Q1         |
-       | 1996       | Q2         |
-       | 1996       | Q3         |
-       | 1996       | Q3         |
-       | 1996       | Q3         |
-       +------------+------------+
-       ...
-
-
-When you query the view, Drill can apply partition pruning and read only the files and directories
required to return query results.
-
-       0: jdbc:drill:zk=local> explain plan for select * from dfs.tmp.vv1 where `year`
= 1996 and qtr = 'Q2';
-       +------------+------------+
-       |    text    |    json    |
-       +------------+------------+
-       | 00-00    Screen
-       00-01      Project(year=[$0], qtr=[$1])
-       00-02        Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/Users/maxdata/multilevel/parquet/1996/Q2/orders_96_q2.parquet]],
selectionRoot=/Users/max/data/multilevel/parquet, numFiles=1, columns=[`dir0`, `dir1`]]])
-       
-
-

http://git-wip-us.apache.org/repos/asf/drill/blob/965bfbf1/_docs/performance-tuning/partition-pruning/010-partition-pruning-introduction.md
----------------------------------------------------------------------
diff --git a/_docs/performance-tuning/partition-pruning/010-partition-pruning-introduction.md
b/_docs/performance-tuning/partition-pruning/010-partition-pruning-introduction.md
new file mode 100755
index 0000000..77c16d8
--- /dev/null
+++ b/_docs/performance-tuning/partition-pruning/010-partition-pruning-introduction.md
@@ -0,0 +1,21 @@
+---
+title: "Partition Pruning Introduction"
+parent: "Partition Pruning"
+--- 
+
+Partition pruning is a performance optimization that limits the number of files and partitions
that Drill reads when querying file systems and Hive tables. When you partition data, Drill
only reads a subset of the files that reside in a file system or a subset of the partitions
in a Hive table when a query matches certain filter criteria.
+
+The query planner in Drill performs partition pruning by evaluating the filters. If no partition
filters are present, the underlying Scan operator reads all files in all directories and then
sends the data to operators, such as Filter, downstream. When partition filters are present,
the query planner pushes the filters down to the Scan if possible. The Scan reads only the
directories that match the partition filters, thus reducing disk I/O.
+
+## Using Partitioned Drill 1.1-1.2 Data
+Before using partitioned Drill 1.1-1.2 data in Drill 1.3, you need to migrate the data. Migrate
Parquet data as described in "Migrating Partitioned Data". 
+
+{% include startimportant.html %}Migrate only Parquet files that Drill generated.{% include
endimportant.html %}
+
+## Partitioning Data
+Prior to the release of Drill 1.1, partition pruning involved time-consuming manual setup
tasks. Using the PARTITION BY clause in the CTAS command simplifies the process. "How to Partition
Data" describes this process.
+
+
+
+
+

http://git-wip-us.apache.org/repos/asf/drill/blob/965bfbf1/_docs/performance-tuning/partition-pruning/020-migrating-partitioned-data.md
----------------------------------------------------------------------
diff --git a/_docs/performance-tuning/partition-pruning/020-migrating-partitioned-data.md
b/_docs/performance-tuning/partition-pruning/020-migrating-partitioned-data.md
new file mode 100755
index 0000000..d3ddcc8
--- /dev/null
+++ b/_docs/performance-tuning/partition-pruning/020-migrating-partitioned-data.md
@@ -0,0 +1,50 @@
+---
+title: "Migrating Partitioned Data"
+parent: "Performance Pruning Introduction"
+--- 
+
+Migrating Parquet data that you partitioned and generated using Drill 1.1 and 1.2 is mandatory
before using the data in Drill 1.3. The data in must be marked as Drill-generated. Use the
[drill-upgrade tool](https://github.com/parthchandra/drill-upgrade) to migrate Parquet data
that you partitioned and generated in Drill 1.1 or 1.2. 
+
+{% include startimportant.html %} Run the upgrade tool only on Drill-generated Parquet files.
{% include endimportant.html %}
+
+<!-- as described in [DRILL-4070](https://issues.apache.org/jira/browse/DRILL-4070). 
-->
+
+## Why Migrate Drill 1.1-1.2 Data
+Parquet data partitioning became available in Drill 1.1 with the introduction of the PARTITION
BY clause of the CTAS command. Drill 1.3 uses the latest (as of the 1.3 release date) Apache
Parquet Library from when generating and partitioning Parquet files, whereas Drill 1.1 and
1.2 use a version of a previous Parquet Library created by the Drill team. The Drill team
fixed a bug in the previous Library to accurately process Parquet files generated by other
tools, such as Impala and Hive. Apache Parquet fixed the bug in the latest Library, making
it suitable for use in Drill 1.3. Drill now uses the same Apache Parquet Library as Impala,
Hive, and other software. You need to run the upgrade tool on Parquet files generated by Drill
1.1 and 1.2 that used the previous Library. 
+
+The upgrade tool simply inserts a version number in the metadata to mark the file as a Drill
file. 
+
+<!-- The bug fix eliminated the risk of inaccurate metadata that could cause incorrect
results when querying Hive- and Pig-generated Parquet files. No such risk exists with Drill-generated
Parquet files. Querying Drill-generated Parquet files, regardless of the Drill version, yields
accurate results. Drill-generated Parquet files, regardless of the Drill release, contain
accurate metadata. -->
+
+## How to Migrate Data
+Use the [drill-upgrade tool](https://github.com/parthchandra/drill-upgrade)to modify one
file at a time. The temp directory holds a copy of the file that is currently being modified
for recovery in the event of a system failure. 
+
+System administrators can write a shell script to run the upgrade tool simultaneously on
multiple sub-directories.
+
+## Preparing for the Migration
+In a test by Drill developers, it took 32 minutes to upgrade 1TB data in 840 files and
+370 minutes to upgrade 100 GB data in 200k files. Although the size of files is a factor
in the upgrade time, the number of files is the most significant factor.
+
+To migrate Parquet data for use in Drill 1.3 that you partitioned and generated in Drill
1.1 or 1.2, follow these steps:
+
+{% include startimportant.html %} Run the upgrade tool only on Drill-generated Parquet files.
{% include endimportant.html %}
+
+1. Back up the data to be migrated.  
+2. Create one or more temp directories, depending on how you plan to run the upgrade tool,
on the same file system as the data.  
+   For example, if the data is on HDFS, create the temp directory on HDFS.
+   Create distinct temp directories when you run the upgrade tool simultaneously on multiple
directories as different directories can have files with same names.  
+3. Access the upgrade tool at TBD.  
+4. If you use [Parquet metadata caching]({{site.baseurl}}/docs/optimizing-parquet-metadata-reading/#how-to-trigger-generation-of-the-parquet-metadata-cache-file):
 
+   * Delete the cache file you generated from all directories and subdirectories where you
plan to run the upgrade tool.  
+   * Run REFRESH TABLE METADATA on all the folders where a cache file previously existed.
 
+5. Run the upgrade tool as shown in the following example:  
+   `java -Dlog.path=/home/rchallapalli/work/drill-upgrade/upgrade.log -cp drill-upgrade-1.0-jar-with-dependencies.jar
org.apache.drill.upgrade.Upgrade_12_13 --tempDir=maprfs:///drill/upgrade-temp maprfs:///drill/testdata/`
+
+## Checking the Success of the Migration
+
+## Handling of Migration Failure
+
+If a network connection goes down, or if a user cancels the operation, the file that was
being processed at the time of cancellation could be corrupted. So we should always copy the
file back from the temp directory. Now if we re-run the upgrade tool, it will skip the files
that it has already processed and only updates the remaining files.
+
+
+

http://git-wip-us.apache.org/repos/asf/drill/blob/965bfbf1/_docs/performance-tuning/partition-pruning/030-partition-pruning.md
----------------------------------------------------------------------
diff --git a/_docs/performance-tuning/partition-pruning/030-partition-pruning.md b/_docs/performance-tuning/partition-pruning/030-partition-pruning.md
new file mode 100755
index 0000000..e376d5d
--- /dev/null
+++ b/_docs/performance-tuning/partition-pruning/030-partition-pruning.md
@@ -0,0 +1,111 @@
+---
+title: "Partition Pruning"
+parent: "Partition Pruning"
+--- 
+
+In Drill 1.1.0 and later, if the data source is Parquet, no data organization tasks are required
to take advantage of partition pruning. To partition and query Parquet files generated from
other tools, use Drill to read and rewrite the files and metadata using the CTAS command with
the PARTITION BY clause, as described in the following section "How to Partition Data".
+
+## How to Partition Data
+
+Write Parquet data using the [PARTITION BY]({{site.baseurl}}/docs/partition-by-clause/) clause
in the CTAS statement. 
+
+The Parquet writer first sorts data by the partition keys, and then creates a new file when
it encounters a new value for the partition columns. During partitioning, Drill creates separate
files, but not separate directories, for different partitions. Each file contains exactly
one partition value, but there can be multiple files for the same partition value. 
+
+Partition pruning uses the Parquet column statistics to determine which columns to use to
prune. 
+
+Unlike using the Drill 1.0 partitioning, no view query is subsequently required, nor is it
necessary to use the [dir* variables]({{site.baseurl}}/docs/querying-directories) after you
use the PARTITION BY clause in a CTAS statement. 
+
+## Drill 1.0 Partitioning
+
+Drill 1.0 does not support the PARTITION BY clause of the CTAS command supported by later
versions. Partitioning Drill 1.0-generated data involves performing the following steps. 
 
+ 
+1. Devise a logical way to store the data in a hierarchy of directories. 
+2. Use CTAS to create Parquet files from the original data, specifying filter conditions.
+3. Move the files into directories in the hierarchy. 
+
+After partitioning the data, you need to create a view of the partitioned data to query the
data. You can use the [dir* variables]({{site.baseurl}}/docs/querying-directories) in queries
to refer to subdirectories in your workspace path.
+ 
+### Drill 1.0 Partitioning Example
+
+Suppose you have text files containing several years of log data. To partition the data by
year and quarter, create the following hierarchy of directories:  
+       
+       …/logs/1994/Q1  
+       …/logs/1994/Q2  
+       …/logs/1994/Q3  
+       …/logs/1994/Q4  
+       …/logs/1995/Q1  
+       …/logs/1995/Q2  
+       …/logs/1995/Q3  
+       …/logs/1995/Q4  
+       …/logs/1996/Q1  
+       …/logs/1996/Q2  
+       …/logs/1996/Q3  
+       …/logs/1996/Q4  
+
+Run the following CTAS statement, filtering on the Q1 1994 data.
+ 
+          CREATE TABLE TT_1994_Q1 
+              AS SELECT * FROM <raw table data in text format >
+              WHERE columns[1] = 1994 AND columns[2] = 'Q1'
+ 
+This creates a Parquet file with the log data for Q1 1994 in the current workspace.  You
can then move the file into the correlating directory, and repeat the process until all of
the files are stored in their respective directories.
+
+Now you can define views on the parquet files and query the views.  
+
+       0: jdbc:drill:zk=local> create view vv1 as select `dir0` as `year`, `dir1` as `qtr`
from dfs.`/Users/max/data/multilevel/parquet`;
+       +------------+------------+
+       |     ok     |  summary   |
+       +------------+------------+
+       | true       | View 'vv1' created successfully in 'dfs.tmp' schema |
+       +------------+------------+
+       1 row selected (0.16 seconds)  
+
+Query the view to see all of the logs.  
+
+       0: jdbc:drill:zk=local> select * from dfs.tmp.vv1;
+       +------------+------------+
+       |    year    |    qtr     |
+       +------------+------------+
+       | 1994       | Q1         |
+       | 1994       | Q3         |
+       | 1994       | Q3         |
+       | 1994       | Q4         |
+       | 1994       | Q4         |
+       | 1994       | Q4         |
+       | 1994       | Q4         |
+       | 1995       | Q2         |
+       | 1995       | Q2         |
+       | 1995       | Q2         |
+       | 1995       | Q2         |
+       | 1995       | Q4         |
+       | 1995       | Q4         |
+       | 1995       | Q4         |
+       | 1995       | Q4         |
+       | 1995       | Q4         |
+       | 1995       | Q4         |
+       | 1995       | Q4         |
+       | 1996       | Q1         |
+       | 1996       | Q1         |
+       | 1996       | Q1         |
+       | 1996       | Q1         |
+       | 1996       | Q1         |
+       | 1996       | Q2         |
+       | 1996       | Q3         |
+       | 1996       | Q3         |
+       | 1996       | Q3         |
+       +------------+------------+
+       ...
+
+
+When you query the view, Drill can apply partition pruning and read only the files and directories
required to return query results.
+
+       0: jdbc:drill:zk=local> explain plan for select * from dfs.tmp.vv1 where `year`
= 1996 and qtr = 'Q2';
+       +------------+------------+
+       |    text    |    json    |
+       +------------+------------+
+       | 00-00    Screen
+       00-01      Project(year=[$0], qtr=[$1])
+       00-02        Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/Users/maxdata/multilevel/parquet/1996/Q2/orders_96_q2.parquet]],
selectionRoot=/Users/max/data/multilevel/parquet, numFiles=1, columns=[`dir0`, `dir1`]]])
+       
+
+


Mime
View raw message