drill-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From a..@apache.org
Subject [01/13] drill git commit: DRILL-2315: Confluence conversion plus fixes
Date Thu, 26 Feb 2015 01:01:10 GMT
Repository: drill
Updated Branches:
  refs/heads/gh-pages-master 23f82db9f -> d959a2100


http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/tutorial/005-lesson3.md
----------------------------------------------------------------------
diff --git a/_docs/tutorial/005-lesson3.md b/_docs/tutorial/005-lesson3.md
new file mode 100644
index 0000000..f6c7ae4
--- /dev/null
+++ b/_docs/tutorial/005-lesson3.md
@@ -0,0 +1,379 @@
+---
+title: "Lession 3: Run Queries on Complex Data Types"
+parent: "Apache Drill Tutorial"
+---
+## Goal
+
+This lesson focuses on queries that exercise functions and operators on self-
+describing data and complex data types. Drill offers intuitive SQL extensions
+to work with such data and offers high query performance with an architecture
+built from the ground up for complex data.
+
+## Queries in This Lesson
+
+Now that you have run ANSI SQL queries against different tables and files with
+relational data, you can try some examples including complex types.
+
+  * Access directories and subdirectories of files in a single SELECT statement.
+  * Demonstrate simple ways to access complex data in JSON files.
+  * Demonstrate the repeated_count function to aggregate values in an array.
+
+## Query Partitioned Directories
+
+You can use special variables in Drill to refer to subdirectories in your
+workspace path:
+
+  * dir0
+  * dir1
+  * …
+
+Note that these variables are dynamically determined based on the partitioning
+of the file system. No up-front definitions are required on what partitions
+exist. Here is a visual example of how this works:
+
+![drill query flow]({{ site.baseurl }}/docs/img/example_query.png)
+
+### Set workspace to dfs.logs:
+
+    0: jdbc:drill:> use dfs.logs;
+    +------------+------------+
+    | ok | summary |
+    +------------+------------+
+    | true | Default schema changed to 'dfs.logs' |
+    +------------+------------+
+
+### Query logs data for a specific year:
+
+    0: jdbc:drill:> select * from logs where dir0='2013' limit 10;
+    +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-----------+------------+
+    | dir0 | dir1 | trans_id | date | time | cust_id | device | state | camp_id | keywords
| prod_id | purch_flag |
+    +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-----------+------------+
+    | 2013 | 11 | 12119 | 11/09/2013 | 02:24:51 | 262 | IOS5 | ny | 0 | chamber | 198 | false
|
+    | 2013 | 11 | 12120 | 11/19/2013 | 09:37:43 | 0 | AOS4.4 | il | 2 | outside | 511 | false
|
+    | 2013 | 11 | 12134 | 11/10/2013 | 23:42:47 | 60343 | IOS5 | ma | 4 | and | 421 | false
|
+    | 2013 | 11 | 12135 | 11/16/2013 | 01:42:13 | 46762 | AOS4.3 | ca | 4 | here's | 349
| false |
+    | 2013 | 11 | 12165 | 11/26/2013 | 21:58:09 | 41987 | AOS4.2 | mn | 4 | he | 271 | false
|
+    | 2013 | 11 | 12168 | 11/09/2013 | 23:41:48 | 8600 | IOS5 | in | 6 | i | 459 | false
|
+    | 2013 | 11 | 12196 | 11/20/2013 | 02:23:06 | 15603 | IOS5 | tn | 1 | like | 324 | false
|
+    | 2013 | 11 | 12203 | 11/25/2013 | 23:50:29 | 221 | IOS6 | tx | 10 | if | 323 | false
|
+    | 2013 | 11 | 12206 | 11/09/2013 | 23:53:01 | 2488 | AOS4.2 | tx | 14 | unlike | 296
| false |
+    | 2013 | 11 | 12217 | 11/06/2013 | 23:51:56 | 0 | AOS4.2 | tx | 9 | can't | 54 | false
|
+    +------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
+
+
+This query constrains files inside the subdirectory named 2013. The variable
+dir0 refers to the first level down from logs, dir1 to the next level, and so
+on. So this query returned 10 of the rows for February 2013.
+
+### Further constrain the results using multiple predicates in the query:
+
+This query returns a list of customer IDs for people who made a purchase via
+an IOS5 device in August 2013.
+
+    0: jdbc:drill:> select dir0 as yr, dir1 as mth, cust_id from logs
+    where dir0='2013' and dir1='8' and device='IOS5' and purch_flag='true'
+    order by `date`;
+    +------------+------------+------------+
+    | yr | mth | cust_id |
+    +------------+------------+------------+
+    | 2013 | 8 | 4 |
+    | 2013 | 8 | 521 |
+    | 2013 | 8 | 1 |
+    | 2013 | 8 | 2 |
+    | 2013 | 8 | 4 |
+    | 2013 | 8 | 549 |
+    | 2013 | 8 | 72827 |
+    | 2013 | 8 | 38127 |
+    ...
+
+### Return monthly counts per customer for a given year:
+
+    0: jdbc:drill:> select cust_id, dir1 month_no, count(*) month_count from logs
+    where dir0=2014 group by cust_id, dir1 order by cust_id, month_no limit 10;
+    +------------+------------+-------------+
+    |  cust_id   |  month_no  | month_count |
+    +------------+------------+-------------+
+    | 0          | 1          | 143         |
+    | 0          | 2          | 118         |
+    | 0          | 3          | 117         |
+    | 0          | 4          | 115         |
+    | 0          | 5          | 137         |
+    | 0          | 6          | 117         |
+    | 0          | 7          | 142         |
+    | 0          | 8          | 19          |
+    | 1          | 1          | 66          |
+    | 1          | 2          | 59          |
+    +------------+------------+-------------+
+    10 rows selected
+
+This query groups the aggregate function by customer ID and month for one
+year: 2014.
+
+## Query Complex Data
+
+Drill provides some specialized operators and functions that you can use to
+analyze nested data natively without transformation. If you are familiar with
+JavaScript notation, you will already know how some of these extensions work.
+
+### Set the workspace to dfs.clicks:
+
+    0: jdbc:drill:> use dfs.clicks;
+    +------------+------------+
+    | ok | summary |
+    +------------+------------+
+    | true | Default schema changed to 'dfs.clicks' |
+    +------------+------------+
+
+### Explore clickstream data:
+
+Note that the user_info and trans_info columns contain nested data: arrays and
+arrays within arrays. The following queries show how to access this complex
+data.
+
+    0: jdbc:drill:> select * from `clicks/clicks.json` limit 5;
+    +------------+------------+------------+------------+------------+
+    | trans_id | date | time | user_info | trans_info |
+    +------------+------------+------------+------------+------------+
+    | 31920 | 2014-04-26 | 12:17:12 | {"cust_id":22526,"device":"IOS5","state":"il"} | {"prod_id":[174,2],"purch_flag":"false"}
|
+    | 31026 | 2014-04-20 | 13:50:29 | {"cust_id":16368,"device":"AOS4.2","state":"nc"} |
{"prod_id":[],"purch_flag":"false"} |
+    | 33848 | 2014-04-10 | 04:44:42 | {"cust_id":21449,"device":"IOS6","state":"oh"} | {"prod_id":[582],"purch_flag":"false"}
|
+    | 32383 | 2014-04-18 | 06:27:47 | {"cust_id":20323,"device":"IOS5","state":"oh"} | {"prod_id":[710,47],"purch_flag":"false"}
|
+    | 32359 | 2014-04-19 | 23:13:25 | {"cust_id":15360,"device":"IOS5","state":"ca"} | {"prod_id":
[0,8,170,173,1,124,46,764,30,711,0,3,25],"purch_flag":"true"} |
+    +------------+------------+------------+------------+------------+
+
+
+### Unpack the user_info column:
+
+    0: jdbc:drill:> select t.user_info.cust_id as custid, t.user_info.device as device,
+    t.user_info.state as state
+    from `clicks/clicks.json` t limit 5;
+    +------------+------------+------------+
+    | custid | device | state |
+    +------------+------------+------------+
+    | 22526 | IOS5 | il |
+    | 16368 | AOS4.2 | nc |
+    | 21449 | IOS6 | oh |
+    | 20323 | IOS5 | oh |
+    | 15360 | IOS5 | ca |
+    +------------+------------+------------+
+
+This query uses a simple table.column.column notation to extract nested column
+data. For example:
+
+    t.user_info.cust_id
+
+where `t` is the table alias provided in the query, `user_info` is a top-level
+column name, and `cust_id` is a nested column name.
+
+The table alias is required; otherwise column names such as `user_info` are
+parsed as table names by the SQL parser.
+
+### Unpack the trans_info column:
+
+    0: jdbc:drill:> select t.trans_info.prod_id as prodid, t.trans_info.purch_flag as
+    purchased
+    from `clicks/clicks.json` t limit 5;
+    +------------+------------+
+    | prodid | purchased |
+    +------------+------------+
+    | [174,2] | false |
+    | [] | false |
+    | [582] | false |
+    | [710,47] | false |
+    | [0,8,170,173,1,124,46,764,30,711,0,3,25] | true |
+    +------------+------------+
+    5 rows selected
+
+Note that this result reveals that the prod_id column contains an array of IDs
+(one or more product ID values per row, separated by commas). The next step
+shows how you to access this kind of data.
+
+## Query Arrays
+
+Now use the [n] notation, where n is the position of the value in an array,
+starting from position 0 (not 1) for the first value. You can use this
+notation to write interesting queries against nested array data.
+
+For example:
+
+    trans_info.prod_id[0]
+
+refers to the first value in the nested prod_id column and
+
+    trans_info.prod_id[20]
+
+refers to the 21st value, assuming one exists.
+
+### Find the first product that is searched for in each transaction:
+
+    0: jdbc:drill:> select t.trans_id, t.trans_info.prod_id[0] from `clicks/clicks.json`
t limit 5;
+    +------------+------------+
+    |  trans_id  |   EXPR$1   |
+    +------------+------------+
+    | 31920      | 174        |
+    | 31026      | null       |
+    | 33848      | 582        |
+    | 32383      | 710        |
+    | 32359      | 0          |
+    +------------+------------+
+    5 rows selected
+
+### For which transactions did customers search on at least 21 products?
+
+    0: jdbc:drill:> select t.trans_id, t.trans_info.prod_id[20]
+    from `clicks/clicks.json` t
+    where t.trans_info.prod_id[20] is not null
+    order by trans_id limit 5;
+    +------------+------------+
+    |  trans_id  |   EXPR$1   |
+    +------------+------------+
+    | 10328      | 0          |
+    | 10380      | 23         |
+    | 10701      | 1          |
+    | 11100      | 0          |
+    | 11219      | 46         |
+    +------------+------------+
+    5 rows selected
+
+This query returns transaction IDs and product IDs for records that contain a
+non-null product ID at the 21st position in the array.
+
+### Return clicks for a specific product range:
+
+    0: jdbc:drill:> select * from (select t.trans_id, t.trans_info.prod_id[0] as prodid,
+    t.trans_info.purch_flag as purchased
+    from `clicks/clicks.json` t) sq
+    where sq.prodid between 700 and 750 and sq.purchased='true'
+    order by sq.prodid;
+    +------------+------------+------------+
+    | trans_id | prodid | purchased |
+    +------------+------------+------------+
+    | 21886 | 704 | true |
+    | 20674 | 708 | true |
+    | 22158 | 709 | true |
+    | 34089 | 714 | true |
+    | 22545 | 714 | true |
+    | 37500 | 717 | true |
+    | 36595 | 718 | true |
+    ...
+
+This query assumes that there is some meaning to the array (that it is an
+ordered list of products purchased rather than a random list).
+
+## Perform Operations on Arrays
+
+### Rank successful click conversions and count product searches for each session:
+
+    0: jdbc:drill:> select t.trans_id, t.`date` as session_date, t.user_info.cust_id as
+    cust_id, t.user_info.device as device, repeated_count(t.trans_info.prod_id) as
+    prod_count, t.trans_info.purch_flag as purch_flag
+    from `clicks/clicks.json` t
+    where t.trans_info.purch_flag = 'true' order by prod_count desc;
+    +------------+--------------+------------+------------+------------+------------+
+    | trans_id | session_date | cust_id | device | prod_count | purch_flag |
+    +------------+--------------+------------+------------+------------+------------+
+    | 37426 | 2014-04-06 | 18709 | IOS5 | 34 | true |
+    | 31589 | 2014-04-16 | 18576 | IOS6 | 31 | true |
+    | 11600 | 2014-04-07 | 4260 | AOS4.2 | 28 | true |
+    | 35074 | 2014-04-03 | 16697 | AOS4.3 | 27 | true |
+    | 17192 | 2014-04-22 | 2501 | AOS4.2 | 26 | true |
+    ...
+
+This query uses an SQL extension, the repeated_count function, to get an
+aggregated count of the array values. The query returns the number of products
+searched for each session that converted into a purchase and ranks the counts
+in descending order. Only clicks that have resulted in a purchase are counted.
+  
+## Store a Result Set in a Table for Reuse and Analysis
+
+Finally, run another correlated subquery that returns a fairly large result
+set. To facilitate additional analysis on this result set, you can easily and
+quickly create a Drill table from the results of the query.
+
+### Continue to use the dfs.clicks workspace
+
+    0: jdbc:drill:> use dfs.clicks;
+    +------------+------------+
+    | ok | summary |
+    +------------+------------+
+    | true | Default schema changed to 'dfs.clicks' |
+    +------------+------------+
+
+### Return product searches for high-value customers:
+
+    0: jdbc:drill:> select o.cust_id, o.order_total, t.trans_info.prod_id[0] as prod_id

+    from hive.orders as o, `clicks/clicks.json` t 
+    where o.cust_id=t.user_info.cust_id 
+    and o.order_total > (select avg(inord.order_total) 
+    from hive.orders inord where inord.state = o.state);
+    +------------+-------------+------------+
+    |  cust_id   | order_total |   prod_id   |
+    +------------+-------------+------------+
+    ...
+    | 9650       | 69          | 16         |
+    | 9650       | 69          | 560        |
+    | 9650       | 69          | 959        |
+    | 9654       | 76          | 768        |
+    | 9656       | 76          | 32         |
+    | 9656       | 76          | 16         |
+    ...
+    +------------+-------------+------------+
+    106,281 rows selected
+
+This query returns a list of products that are being searched for by customers
+who have made transactions that are above the average in their states.
+
+### Materialize the result of the previous query:
+
+    0: jdbc:drill:> create table product_search as select o.cust_id, o.order_total, t.trans_info.prod_id[0]
as prod_id
+    from hive.orders as o, `clicks/clicks.json` t 
+    where o.cust_id=t.user_info.cust_id and o.order_total > (select avg(inord.order_total)

+    from hive.orders inord where inord.state = o.state);
+    +------------+---------------------------+
+    |  Fragment  | Number of records written |
+    +------------+---------------------------+
+    | 0_0        | 106281                    |
+    +------------+---------------------------+
+    1 row selected
+
+This example uses a CTAS statement to create a table based on a correlated
+subquery that you ran previously. This table contains all of the rows that the
+query returns (106,281) and stores them in the format specified by the storage
+plugin (Parquet format in this example). You can create tables that store data
+in csv, parquet, and json formats.
+
+### Query the new table to verify the row count:
+
+This example simply checks that the CTAS statement worked by verifying the
+number of rows in the table.
+
+    0: jdbc:drill:> select count(*) from product_search;
+    +------------+
+    |   EXPR$0   |
+    +------------+
+    | 106281     |
+    +------------+
+    1 row selected
+
+### Find the storage file for the table:
+
+    [root@maprdemo product_search]# cd /mapr/demo.mapr.com/data/nested/product_search
+    [root@maprdemo product_search]# ls -la
+    total 451
+    drwxr-xr-x. 2 mapr mapr      1 Sep 15 13:41 .
+    drwxr-xr-x. 4 root root      2 Sep 15 13:41 ..
+    -rwxr-xr-x. 1 mapr mapr 460715 Sep 15 13:41 0_0_0.parquet
+
+Note that the table is stored in a file called `0_0_0.parquet`. This file is
+stored in the location defined by the dfs.clicks workspace:
+
+    "location": "http://demo.mapr.com/data/nested"
+
+with a subdirectory that has the same name as the table you created.
+
+## What's Next
+
+Complete the tutorial with the [Summary](/drill/docs/summary).
+
+
+

http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/tutorial/006-summary.md
----------------------------------------------------------------------
diff --git a/_docs/tutorial/006-summary.md b/_docs/tutorial/006-summary.md
new file mode 100644
index 0000000..552d72f
--- /dev/null
+++ b/_docs/tutorial/006-summary.md
@@ -0,0 +1,13 @@
+---
+title: "Summary"
+parent: "Apache Drill Tutorial"
+---
+This tutorial introduced Apache Drill and its ability to run ANSI SQL queries
+against various data sources, including Hive tables, MapR-DB/HBase tables, and
+file system directories. The tutorial also showed how to work with and
+manipulate complex and multi-structured data commonly found in Hadoop/NoSQL
+systems.
+
+Now that you are familiar with different ways to access the sample data with
+Drill, you can try writing your own queries against your own data sources.
+Refer to the [Apache Drill documentation](/drill/docs) for more information.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/tutorial/install-sandbox/001-install-mapr-vm.md
----------------------------------------------------------------------
diff --git a/_docs/tutorial/install-sandbox/001-install-mapr-vm.md b/_docs/tutorial/install-sandbox/001-install-mapr-vm.md
new file mode 100644
index 0000000..9c0e19e
--- /dev/null
+++ b/_docs/tutorial/install-sandbox/001-install-mapr-vm.md
@@ -0,0 +1,49 @@
+---
+title: "Installing the MapR Sandbox with Apache Drill on VMware Player/VMware Fusion"
+parent: "Installing the Apache Drill Sandbox"
+---
+Complete the following steps to install the MapR Sandbox with Apache Drill on
+VMware Player or VMware Fusion:
+
+1. Download the MapR Sandbox with Drill file to a directory on your machine:  
+   <https://www.mapr.com/products/mapr-sandbox-hadoop/download-sandbox-drill>
+2. Open the virtual machine player, and select the **Open a Virtual Machine **option.  
+  
+    **Tip for VMware Fusion**  
+
+    If you are running VMware Fusion, select** Import**.  
+
+    ![drill query flow]({{ site.baseurl }}/docs/img/vmWelcome.png)
+3. Navigate to the directory where you downloaded the MapR Sandbox with Apache Drill file,
and select `MapR-Sandbox-For-Apache-Drill-4.0.1_VM.ova`.
+
+    ![drill query flow]({{ site.baseurl }}/docs/img/vmShare.png)
+
+    The Import Virtual Machine dialog appears.
+4. Click **Import**. The virtual machine player imports the sandbox.
+
+    ![drill query flow]({{ site.baseurl }}/docs/img/vmLibrary.png)
+5. Select `MapR-Sandbox-For-Apache-Drill-4.0.1_VM`, and click **Play virtual machine**. It
takes a few minutes for the MapR services to start.  
+
+     After the MapR services start and installation completes, the following screen
+appears:
+
+     ![drill query flow]({{ site.baseurl }}/docs/img/loginSandBox.png)
+
+     Note the URL provided in the screen, which corresponds to the Web UI in Apache
+Drill.
+6. Verify that a DNS entry was created on the host machine for the virtual machine. If not,
create the entry.
+    * For Linux and Mac, create the entry in `/etc/hosts`.  
+    * For Windows, create the entry in the `%WINDIR%\system32\drivers\etc\hosts` file.  
 
+     
+    For example: `127.0.1.1 <vm_hostname>`
+
+7. You can navigate to the URL provided to experience Drill Web UI or you can login to the
sandbox through the command line.  
+
+    a. To navigate to the MapR Sandbox with Apache Drill, enter the provided URL in your
browser's address bar.  
+    b. To login to the virtual machine and access the command line, press Alt+F2 on Windows
or Option+F5 on Mac. When prompted, enter `mapr` as the login and password.
+
+## What's Next
+
+After downloading and installing the sandbox, continue with the tutorial by
+[Getting to Know the Drill
+Setup](/drill/docs/getting-to-know-the-drill-sandbox).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/tutorial/install-sandbox/002-install-mapr-vb.md
----------------------------------------------------------------------
diff --git a/_docs/tutorial/install-sandbox/002-install-mapr-vb.md b/_docs/tutorial/install-sandbox/002-install-mapr-vb.md
new file mode 100644
index 0000000..36b10c0
--- /dev/null
+++ b/_docs/tutorial/install-sandbox/002-install-mapr-vb.md
@@ -0,0 +1,64 @@
+---
+title: "Installing the MapR Sandbox with Apache Drill on VirtualBox"
+parent: "Installing the Apache Drill Sandbox"
+---
+The MapR Sandbox for Apache Drill on VirtualBox comes with NAT port forwarding
+enabled, which allows you to access the sandbox using localhost as hostname.
+
+Complete the following steps to install the MapR Sandbox with Apache Drill on
+VirtualBox:
+
+1. Download the MapR Sandbox with Apache Drill file to a directory on your machine:   
+<https://www.mapr.com/products/mapr-sandbox-hadoop/download-sandbox-drill>
+2. Open the virtual machine player.
+3. Select **File > Import Appliance**. The Import Virtual Appliance dialog appears.
+
+     ![drill query flow]({{ site.baseurl }}/docs/img/vbImport.png)
+4. Navigate to the directory where you downloaded the MapR Sandbox with Apache Drill and
click **Next**. The Appliance Settings window appears.
+
+     ![drill query flow]({{ site.baseurl }}/docs/img/vbApplSettings.png)
+5. Select the check box at the bottom of the screen: **Reinitialize the MAC address of all
network cards**, then click **Import**. The Import Appliance imports the sandbox.
+6. When the import completes, select **File > Preferences**. The VirtualBox - Settings
dialog appears.
+
+     ![drill query flow]({{ site.baseurl }}/docs/img/vbNetwork.png)
+7. Select **Network**.  
+
+    The correct setting depends on your network connectivity when you run the
+Sandbox. In general, if you are going to use a wired Ethernet connection,
+select **NAT Networks **and **vboxnet0**. If you are going to use a wireless
+network, select **Host-only Networks** and the **VirtualBox Host-Only Ethernet
+Adapter**. If no adapters appear, click the green** +** button to add the
+VirtualBox adapter.
+
+     ![drill query flow]({{ site.baseurl }}/docs/img/vbMaprSetting.png)
+8. Click **OK **to continue.
+9. Click Settings.
+
+    ![settings icon]({{ site.baseurl }}/docs/img/settings.png)  
+   The MapR-Sandbox-For-Apache-Drill-0.6.0-r2-4.0.1 - Settings dialog appears.
+     
+     ![drill query flow]({{ site.baseurl }}/docs/img/vbGenSettings.png)    
+10. Click **OK** to continue.
+11. Click **Start**. It takes a few minutes for the MapR services to start.   
+ 
+      After the MapR services start and installation completes, the following screen appears:
+      
+       ![drill query flow]({{ site.baseurl }}/docs/img/vbloginSandBox.png)
+12. The client must be able to resolve the actual hostname of the Drill node(s) with the
IP(s). Verify that a DNS entry was created on the client machine for the Drill node(s).  
+ 
+     If a DNS entry does not exist, create the entry for the Drill node(s).
+     * For Windows, create the entry in the %WINDIR%\system32\drivers\etc\hosts file.
+     * For Linux and Mac, create the entry in /etc/hosts.  
+<drill-machine-IP> <drill-machine-hostname>  
+  
+     Example: `127.0.1.1 maprdemo`
+13. You can navigate to the URL provided or to [localhost:8047](http://localhost:8047) to
experience the Drill Web UI, or you can log into the sandbox through the command line.  
+
+    a. To navigate to the MapR Sandbox with Apache Drill, enter the provided URL in your
browser's address bar.  
+    b. To log into the virtual machine and access the command line, enter Alt+F2 on Windows
or Option+F5 on Mac. When prompted, enter `mapr` as the login and password.
+
+# What's Next
+
+After downloading and installing the sandbox, continue with the tutorial by
+[Getting to Know the Drill
+Setup](/drill/docs/getting-to-know-the-drill-sandbox).
\ No newline at end of file


Mime
View raw message