drill-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bridg...@apache.org
Subject [1/2] drill git commit: polishing, reorg for 1.2
Date Thu, 01 Oct 2015 01:04:08 GMT
Repository: drill
Updated Branches:
  refs/heads/gh-pages f4dc0cf9f -> 0cfc6b620


polishing, reorg for 1.2

new section intro

format code

formatting

1.2 updates

formatting

formatting

minor edits


Project: http://git-wip-us.apache.org/repos/asf/drill/repo
Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/297b4304
Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/297b4304
Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/297b4304

Branch: refs/heads/gh-pages
Commit: 297b4304dd1b2b1a08adbf6f2d286c2334029ca2
Parents: f4dc0cf
Author: Kristine Hahn <khahn@maprtech.com>
Authored: Mon Sep 28 15:10:27 2015 -0700
Committer: Kristine Hahn <khahn@maprtech.com>
Committed: Mon Sep 28 16:53:51 2015 -0700

----------------------------------------------------------------------
 _docs/query-data/030-querying-hbase.md          | 187 +++++++++++--------
 .../020-querying-parquet-files.md               |   2 +-
 _docs/sql-reference/090-sql-extensions.md       |  10 +-
 3 files changed, 122 insertions(+), 77 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/drill/blob/297b4304/_docs/query-data/030-querying-hbase.md
----------------------------------------------------------------------
diff --git a/_docs/query-data/030-querying-hbase.md b/_docs/query-data/030-querying-hbase.md
index 486cd61..b6a96e4 100644
--- a/_docs/query-data/030-querying-hbase.md
+++ b/_docs/query-data/030-querying-hbase.md
@@ -3,84 +3,34 @@ title: "Querying HBase"
 parent: "Query Data"
 ---
 
-To use Drill to query HBase data, you need to understand how to work with the HBase byte
arrays. If you want Drill to interpret the underlying HBase row key as something other than
a byte array, you need to know the encoding of the data in HBase. By default, HBase stores
data in little endian and Drill assumes the data is little endian, which is unsorted. The
following table shows the sorting of typical row key IDs in bytes, encoded in little endian
and big endian, respectively:
+This section covers the following topics:
 
-| IDs in Byte Notation Little Endian Sorting | IDs in Decimal Notation | IDs in Byte Notation
Big Endian Sorting | IDs in Decimal Notation |
-|--------------------------------------------|-------------------------|-----------------------------------------|-------------------------|
-| 0 x 010000 . . . 000                       | 1                       | 0 x 00000001   
                        | 1                       |
-| 0 x 010100 . . . 000                       | 17                      | 0 x 00000002   
                        | 2                       |
-| 0 x 020000 . . . 000                       | 2                       | 0 x 00000003   
                        | 3                       |
-| . . .                                      |                         | 0 x 00000004   
                        | 4                       |
-| 0 x 050000 . . . 000                       | 5                       | 0 x 00000005   
                        | 5                       |
-| . . .                                      |                         | . . .          
                        |                         |
-| 0 x 0A0000 . . . 000                       | 10                      | 0 x 0000000A   
                        | 10                      |
-|                                            |                         | 0 x 00000101   
                        | 17                      |
+* [Tutorial--Querying HBase Data]({{site.baseurl}}/docs/querying-hbase/#tutorial-querying-hbase-data)
 
+  A simple tutorial that shows how to use Drill to query HBase data.  
+* [Working with HBase Byte Arrays]({{site.baseurl}}/docs/querying-hbase/#working-with-hbase-byte-arrays)
 
+  How to work with HBase byte arrays for serious applications  
+* [Querying Big Endian-Encoded Data]({{site.baseurl}}/docs/querying-hbase/#querying-big-endian-encoded-data)
 
+  How to use optimization features in Drill 1.2 and later  
+* [Leveraging HBase Ordered Byte Encoding]({{site.baseurl}}/docs/querying-hbase/#leveraging-hbase-ordered-byte-encoding)
 
+  How to use Drill 1.2 to leverage new features introduced by [HBASE-8201 Jira](https://issues.apache.org/jira/browse/HBASE-8201)
 
-## Querying Big Endian-Encoded Data
+## Tutorial--Querying HBase Data
 
-Drill optimizes scans of HBase tables when you use the ["CONVERT_TO and CONVERT_FROM data
types"]({{ site.baseurl }}/docs/supported-data-types/#convert_to-and-convert_from-data-types)
on big endian-encoded data. Drill provides the \*\_BE encoded types for use with CONVERT_TO
and CONVERT_FROM to take advantage of these optimizations. Here are a few examples of the
\*\_BE types.
+This tutorial shows how to connect Drill to an HBase data source, create simple HBase tables,
and query the data using Drill.
 
-* DATE_EPOCH_BE  
-* TIME_EPOCH_BE  
-* TIMESTAMP_EPOCH_BE  
-* UINT8_BE  
-* BIGINT_BE  
+----------
 
-For example, Drill returns results performantly when you use the following query on big endian-encoded
data:
+### Configure the HBase Storage Plugin
 
-```
-SELECT
-  CONVERT_FROM(BYTE_SUBSTR(row_key, 1, 8), 'DATE_EPOCH_BE') d,
-  CONVERT_FROM(BYTE_SUBSTR(row_key, 9, 8), 'BIGINT_BE') id,
-  CONVERT_FROM(tableName.f.c, 'UTF8') 
-FROM hbase.`TestTableCompositeDate` tableName
-WHERE
-  CONVERT_FROM(BYTE_SUBSTR(row_key, 1, 8), 'DATE_EPOCH_BE') < DATE '2015-06-18' AND
-  CONVERT_FROM(BYTE_SUBSTR(row_key, 1, 8), 'DATE_EPOCH_BE') > DATE '2015-06-13';
-```
+To query an HBase data source using Drill, first configure the [HBase storage plugin]({{site.baseurl}}/docs/hbase-storage-plugin/)
for your environment. 
 
-This query assumes that the row key of the table represents the DATE_EPOCH type encoded in
big-endian format. The Drill HBase plugin will be able to prune the scan range since there
is a condition on the big endian-encoded prefix of the row key. For more examples, see the
[test code](https://github.com/apache/drill/blob/95623912ebf348962fe8a8846c5f47c5fdcf2f78/contrib/storage-hbase/src/test/java/org/apache/drill/hbase/TestHBaseFilterPushDown.java).
-
-To query HBase data:
-
-1. Connect the data source to Drill using the [HBase storage plugin]({{site.baseurl}}/docs/hbase-storage-plugin/).
 
-2. Determine the encoding of the HBase data you want to query. Ask the person in charge of
creating the data.  
-3. Based on the encoding type of the data, use the ["CONVERT_TO and CONVERT_FROM data types"]({{
site.baseurl }}/docs/supported-data-types/#convert_to-and-convert_from-data-types) to convert
HBase binary representations to an SQL type as you query the data.  
-    For example, use CONVERT_FROM in your Drill query to convert a big endian-encoded row
key to an SQL BIGINT type:  
-
-    `SELECT CONVERT_FROM(BYTE_SUBSTR(row_key, 1, 8),'BIGINT_BE’) FROM my_hbase_table;`
-
-The [BYTE_SUBSTR function]({{ site.baseurl }}/docs/string-manipulation/#byte_substr) separates
parts of a HBase composite key in this example. The Drill optimization is based on the capability
in Drill 1.2 and later to push conditional filters down to the storage layer when HBase data
is in big endian format. 
-
-Drill can performantly query HBase data that uses composite keys, as shown in the last example,
if only the first component of the composite is encoded in big endian format. If the HBase
row key is not stored in big endian, do not use the \*\_BE types. If you want to convert a
little endian byte array to integer, use BIGINT instead of BIGINT_BE, for example, as an argument
to CONVERT_FROM. 
-
-## Leveraging HBase Ordered Byte Encoding
-
-Drill 1.2 leverages new features introduced by [HBASE-8201 Jira](https://issues.apache.org/jira/browse/HBASE-8201)
that allows ordered byte encoding of different data types. This encoding scheme preserves
the sort order of the native data type when the data is stored as sorted byte arrays on disk.
Thus, Drill will be able to process data through the HBase storage plugin if the row keys
have been encoded in OrderedBytes format.
-
-To execute the following query, Drill prunes the scan range to only include the row keys
representing [-32,59] range, thus reducing the amount of data read.
-
-```
-SELECT
- CONVERT_FROM(t.row_key, 'INT_OB') rk,
- CONVERT_FROM(t.`f`.`c`, 'UTF8') val
-FROM
-  hbase.`TestTableIntOB` t
-WHERE
-  CONVERT_FROM(row_key, 'INT_OB') >= cast(-32 as INT) AND
-  CONVERT_FROM(row_key, 'INT_OB') < cast(59 as INT);
-```
-
-For more examples, see the [test code](https://github.com/apache/drill/blob/95623912ebf348962fe8a8846c5f47c5fdcf2f78/contrib/storage-hbase/src/test/java/org/apache/drill/hbase/TestHBaseFilterPushDown.java).
-
-By taking advantage of ordered byte encoding, Drill 1.2 and later can performantly execute
conditional queries without a secondary index on HBase big endian data. 
-
-## Querying Little Endian-Encoded Data
-As mentioned earlier, HBase stores data in little endian by default and Drill assumes the
data is encoded in little endian. This exercise involves working with data that is encoded
in little endian. First, you create two tables in HBase, students and clicks, that you can
query with Drill. You use the CONVERT_TO and CONVERT_FROM functions to convert binary text
to/from typed data. You use the CAST function to convert the binary data to an INT in step
4 of [Query HBase Tables]({{site.baseurl}}/docs/querying-hbase/#query-hbase-tables). When
converting an INT or BIGINT number, having a byte count in the destination/source that does
not match the byte count of the number in the binary source/destination, use CAST.
+----------
 
 ### Create the HBase tables
 
-To create the HBase tables and start Drill, complete the following
+You create two tables in HBase, students and clicks, that you can query with Drill. You use
the CONVERT_TO and CONVERT_FROM functions to convert binary text to/from typed data. You use
the CAST function to convert the binary data to an INT in step 4 of [Query HBase Tables]({{site.baseurl}}/docs/querying-hbase/#query-hbase-tables).
When converting an INT or BIGINT number, having a byte count in the destination/source that
does not match the byte count of the number in the binary source/destination, use CAST.
+
+To create the HBase tables, complete the following
 steps:
 
 1. Pipe the following commands to the HBase shell to create students and clicks tables in
HBase:
@@ -90,7 +40,7 @@ steps:
 
 2. Issue the following command to create a `testdata.txt` file:
 
-      cat > testdata.txt
+    `cat > testdata.txt`
 
 3. Copy and paste the following `put` commands on the line below the **cat** command. Press
Return, and then CTRL Z to close the file.
 
@@ -160,11 +110,18 @@ steps:
   
         cat testdata.txt | hbase shell
 
+----------
+
 ### Query HBase Tables
-1. Issue the following query to see the data in the students table:  
 
-       SELECT * FROM students;
-   The query returns results that are not useable. In the next step, you convert the data.
+[Start Drill]({{site.baseurl}}/docs/installing-drill-in-embedded-mode/) and complete the
following steps to query the HBase tables you created.
+
+1. Use the HBase storage plugin configuration.  
+    `USE HBase;`  
+2. Issue the following query to see the data in the students table:  
+    `SELECT * FROM students;`  
+    
+    The query returns results that are not useable. In the next step, you convert the data
from byte arrays to UTF8 types that are meaningful.
   
         +-------------+-----------------------+---------------------------------------------------------------------------+
         |  row_key    |  account              |                                address  
                                 |
@@ -176,7 +133,7 @@ steps:
         +-------------+-----------------------+---------------------------------------------------------------------------+
         4 rows selected (1.335 seconds)
 
-2. Issue the following query, that includes the CONVERT_FROM function, to convert the `students`
table to typed data:
+3. Issue the following query, that includes the CONVERT_FROM function, to convert the `students`
table to typed data:
 
          SELECT CONVERT_FROM(row_key, 'UTF8') AS studentid, 
                 CONVERT_FROM(students.account.name, 'UTF8') AS name, 
@@ -200,7 +157,7 @@ steps:
         +------------+------------+------------+------------------+------------+
         4 rows selected (0.504 seconds)
 
-3. Query the clicks table to see which students visited google.com:
+4. Query the clicks table to see which students visited google.com:
   
         SELECT CONVERT_FROM(row_key, 'UTF8') AS clickid, 
                CONVERT_FROM(clicks.clickinfo.studentid, 'UTF8') AS studentid, 
@@ -217,7 +174,7 @@ steps:
         +------------+------------+--------------------------+-----------------------+
         3 rows selected (0.294 seconds)
 
-4. Query the clicks table to get the studentid of the student having 100 items. Use CONVERT_FROM
to convert the textual studentid and itemtype data, but use CAST to convert the integer quantity.
+5. Query the clicks table to get the studentid of the student having 100 items. Use CONVERT_FROM
to convert the textual studentid and itemtype data, but use CAST to convert the integer quantity.
 
         SELECT CONVERT_FROM(tbl.clickinfo.studentid, 'UTF8') AS studentid, 
                CONVERT_FROM(tbl.iteminfo.itemtype, 'UTF8'), 
@@ -230,3 +187,83 @@ steps:
         | student2   | text       | 100        |
         +------------+------------+------------+
         1 row selected (0.656 seconds)
+
+
+## Working with HBase Byte Arrays
+
+The trivial example in the previous section queried little endian-encoded data in HBase.
For serious applications, you need to understand how to work with HBase byte arrays. If you
want Drill to interpret the underlying HBase row key as something other than a byte array,
you need to know the encoding of the data in HBase. By default, HBase stores data in little
endian and Drill assumes the data is little endian, which is unsorted. The following table
shows the sorting of typical row key IDs in bytes, encoded in little endian and big endian,
respectively:
+
+| IDs in Byte Notation Little Endian Sorting | IDs in Decimal Notation | IDs in Byte Notation
Big Endian Sorting | IDs in Decimal Notation |
+|--------------------------------------------|-------------------------|-----------------------------------------|-------------------------|
+| 0 x 010000 . . . 000                       | 1                       | 0 x 00000001   
                        | 1                       |
+| 0 x 010100 . . . 000                       | 17                      | 0 x 00000002   
                        | 2                       |
+| 0 x 020000 . . . 000                       | 2                       | 0 x 00000003   
                        | 3                       |
+| . . .                                      |                         | 0 x 00000004   
                        | 4                       |
+| 0 x 050000 . . . 000                       | 5                       | 0 x 00000005   
                        | 5                       |
+| . . .                                      |                         | . . .          
                        |                         |
+| 0 x 0A0000 . . . 000                       | 10                      | 0 x 0000000A   
                        | 10                      |
+|                                            |                         | 0 x 00000101   
                        | 17                      |
+
+## Querying Big Endian-Encoded Data
+
+Drill optimizes scans of HBase tables when you use the ["CONVERT_TO and CONVERT_FROM data
types"]({{ site.baseurl }}/docs/supported-data-types/#convert_to-and-convert_from-data-types)
on big endian-encoded data. Drill provides the \*\_BE encoded types for use with CONVERT_TO
and CONVERT_FROM to take advantage of these optimizations. Here are a few examples of the
\*\_BE types.
+
+* DATE_EPOCH_BE  
+* TIME_EPOCH_BE  
+* TIMESTAMP_EPOCH_BE  
+* UINT8_BE  
+* BIGINT_BE  
+
+For example, Drill returns results performantly when you use the following query on big endian-encoded
data:
+
+```
+SELECT
+  CONVERT_FROM(BYTE_SUBSTR(row_key, 1, 8), 'DATE_EPOCH_BE') d,
+  CONVERT_FROM(BYTE_SUBSTR(row_key, 9, 8), 'BIGINT_BE') id,
+  CONVERT_FROM(tableName.f.c, 'UTF8') 
+FROM hbase.`TestTableCompositeDate` tableName
+WHERE
+  CONVERT_FROM(BYTE_SUBSTR(row_key, 1, 8), 'DATE_EPOCH_BE') < DATE '2015-06-18' AND
+  CONVERT_FROM(BYTE_SUBSTR(row_key, 1, 8), 'DATE_EPOCH_BE') > DATE '2015-06-13';
+```
+
+This query assumes that the row key of the table represents the DATE_EPOCH type encoded in
big-endian format. The Drill HBase plugin will be able to prune the scan range since there
is a condition on the big endian-encoded prefix of the row key. For more examples, see the
[test code](https://github.com/apache/drill/blob/95623912ebf348962fe8a8846c5f47c5fdcf2f78/contrib/storage-hbase/src/test/java/org/apache/drill/hbase/TestHBaseFilterPushDown.java).
+
+To query HBase data:
+
+1. Connect the data source to Drill using the [HBase storage plugin]({{site.baseurl}}/docs/hbase-storage-plugin/).
 
+
+    `USE HBase;`
+
+2. Determine the encoding of the HBase data you want to query. Ask the person in charge of
creating the data.  
+3. Based on the encoding type of the data, use the ["CONVERT_TO and CONVERT_FROM data types"]({{
site.baseurl }}/docs/supported-data-types/#convert_to-and-convert_from-data-types) to convert
HBase binary representations to an SQL type as you query the data.  
+    For example, use CONVERT_FROM in your Drill query to convert a big endian-encoded row
key to an SQL BIGINT type:  
+
+    `SELECT CONVERT_FROM(BYTE_SUBSTR(row_key, 1, 8),'BIGINT_BE’) FROM my_hbase_table;`
+
+The [BYTE_SUBSTR function]({{ site.baseurl }}/docs/string-manipulation/#byte_substr) separates
parts of a HBase composite key in this example. The Drill optimization is based on the capability
in Drill 1.2 and later to push conditional filters down to the storage layer when HBase data
is in big endian format. 
+
+Drill can performantly query HBase data that uses composite keys, as shown in the last example,
if only the first component of the composite is encoded in big endian format. If the HBase
row key is not stored in big endian, do not use the \*\_BE types. If you want to convert a
little endian byte array to integer, use BIGINT instead of BIGINT_BE, for example, as an argument
to CONVERT_FROM. 
+
+## Leveraging HBase Ordered Byte Encoding
+
+Drill 1.2 leverages new features introduced by [HBASE-8201 Jira](https://issues.apache.org/jira/browse/HBASE-8201)
that allows ordered byte encoding of different data types. This encoding scheme preserves
the sort order of the native data type when the data is stored as sorted byte arrays on disk.
Thus, Drill will be able to process data through the HBase storage plugin if the row keys
have been encoded in OrderedBytes format.
+
+To execute the following query, Drill prunes the scan range to only include the row keys
representing [-32,59] range, thus reducing the amount of data read.
+
+```
+SELECT
+ CONVERT_FROM(t.row_key, 'INT_OB') rk,
+ CONVERT_FROM(t.`f`.`c`, 'UTF8') val
+FROM
+  hbase.`TestTableIntOB` t
+WHERE
+  CONVERT_FROM(row_key, 'INT_OB') >= cast(-32 as INT) AND
+  CONVERT_FROM(row_key, 'INT_OB') < cast(59 as INT);
+```
+
+For more examples, see the [test code](https://github.com/apache/drill/blob/95623912ebf348962fe8a8846c5f47c5fdcf2f78/contrib/storage-hbase/src/test/java/org/apache/drill/hbase/TestHBaseFilterPushDown.java).
+
+By taking advantage of ordered byte encoding, Drill 1.2 and later can performantly execute
conditional queries without a secondary index on HBase big endian data. 
+
+

http://git-wip-us.apache.org/repos/asf/drill/blob/297b4304/_docs/query-data/query-a-file-system/020-querying-parquet-files.md
----------------------------------------------------------------------
diff --git a/_docs/query-data/query-a-file-system/020-querying-parquet-files.md b/_docs/query-data/query-a-file-system/020-querying-parquet-files.md
index d6b067f..b0228db 100644
--- a/_docs/query-data/query-a-file-system/020-querying-parquet-files.md
+++ b/_docs/query-data/query-a-file-system/020-querying-parquet-files.md
@@ -5,7 +5,7 @@ parent: "Querying a File System"
 
 Drill 1.2 and later extends SQL for performant querying of a large number, thousands or more,
of Parquet files. By running the following command, you trigger the generation of metadata
files in the directory of Parquet files and its subdirectories:
 
-    REFRESH TABLE METADATA <path to table>
+`REFRESH TABLE METADATA <path to table>`
 
 You need to run the command on a file or directory only once during the session. Subsequent
queries return results quickly because Drill refers to the metadata saved in the cache, as
described in [Reading Parquet Files]({{site.baseurl}}/docs/parquet-format/#reading-parquet-files).

 

http://git-wip-us.apache.org/repos/asf/drill/blob/297b4304/_docs/sql-reference/090-sql-extensions.md
----------------------------------------------------------------------
diff --git a/_docs/sql-reference/090-sql-extensions.md b/_docs/sql-reference/090-sql-extensions.md
index 90cfed7..31f594e 100644
--- a/_docs/sql-reference/090-sql-extensions.md
+++ b/_docs/sql-reference/090-sql-extensions.md
@@ -2,12 +2,20 @@
 title: "SQL Extensions"
 parent: "SQL Reference"
 ---
-Drill extends SQL to work with Hadoop-scale data and to explore smaller-scale data in ways
not possible with SQL. Using intuitive SQL extensions you work with self-describing data and
complex data types. Extensions to SQL include capabilities for exploring self-describing data,
such as files and HBase, directly in the native format.
+Drill extends SQL to generating Parquet metadata, to work with Hadoop-scale data, and to
explore smaller-scale data in ways not possible with SQL. Using intuitive SQL extensions you
work with self-describing data and complex data types. Extensions to SQL include capabilities
for exploring self-describing data, such as files and HBase, directly in the native format.
 
 Drill provides language support for pointing to [storage plugin]({{site.baseurl}}/docs/connect-a-data-source-introduction)
interfaces that Drill uses to interact with data sources. Use the name of a storage plugin
to specify a file system *database* as a prefix in queries when you refer to objects across
databases. Query files, including compressed .gz files, and [directories]({{ site.baseurl
}}/docs/querying-directories), as you would query an SQL table. You can query multiple files
in a directory.
 
 Drill extends the SELECT statement for reading complex, multi-structured data. The extended
CREATE TABLE AS provides the capability to write data of complex/multi-structured data types.
Drill extends the [lexical rules](http://drill.apache.org/docs/lexical-structure) for working
with files and directories, such as using back ticks for including file names, directory names,
and reserved words in queries. Drill syntax supports using the file system as a persistent
store for query profiles and diagnostic information.
 
+## Extension for Generating Parquet Metadata
+
+To speed querying of Parquet files, you can [generate metadata]({{site.baseurl}}/docs/querying-parquet-files/)
in Drill 1.2 and later. Running the following command triggers the generation of metadata
files in a directory of Parquet files and its subdirectories:
+
+`REFRESH TABLE METADATA <path to table>`
+
+Drill takes advantage of metadata, such as the Hive metadata store, and generates a [metadata
cache]({{site.baseurl}}/docs/parquet-format/#caching-metadata). Using metadata can improve
performance of queries on a large number of files. 
+
 ## Extensions for Hive- and HBase-related Data Sources
 
 Drill supports Hive and HBase as a plug-and-play data source. Drill can read tables created
in Hive that use [data types compatible]({{ site.baseurl }}/docs/hive-to-drill-data-type-mapping)
with Drill.  You can query Hive tables without modifications. You can query self-describing
data without requiring metadata definitions in the Hive metastore. Primitives, such as JOIN,
support columnar operation. 


Mime
View raw message