drill-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tshi...@apache.org
Subject [01/26] drill git commit: exhume Basics Tutorial to address user question
Date Sat, 30 May 2015 05:03:27 GMT
Repository: drill
Updated Branches:
  refs/heads/gh-pages 446d71c24 -> a6822dc4f


exhume Basics Tutorial to address user question


Project: http://git-wip-us.apache.org/repos/asf/drill/repo
Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/55f25490
Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/55f25490
Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/55f25490

Branch: refs/heads/gh-pages
Commit: 55f2549073de462f3dd507e904b45dc4a110a287
Parents: 45c29be
Author: Kristine Hahn <khahn@maprtech.com>
Authored: Mon May 25 12:21:22 2015 -0700
Committer: Kristine Hahn <khahn@maprtech.com>
Committed: Mon May 25 12:21:22 2015 -0700

----------------------------------------------------------------------
 .../030-querying-plain-text-files.md            | 188 ++++++++++++++++++-
 .../040-querying-directories.md                 |  34 ++++
 .../030-date-time-functions-and-arithmetic.md   |   2 +-
 _docs/tutorials/020-drill-in-10-minutes.md      |   2 +-
 4 files changed, 219 insertions(+), 7 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/drill/blob/55f25490/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md
----------------------------------------------------------------------
diff --git a/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md b/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md
index 8924835..ab73c57 100644
--- a/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md
+++ b/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md
@@ -2,16 +2,20 @@
 title: "Querying Plain Text Files"
 parent: "Querying a File System"
 ---
-You can use Drill to access both structured file types and plain text files
-(flat files). This section shows a few simple examples that work on flat
-files:
+You can use Drill to access structured file types and plain text files
+(flat files), such as the following file types:
 
   * CSV files (comma-separated values)
   * TSV files (tab-separated values)
   * PSV files (pipe-separated values)
 
-The examples here show CSV files, but queries against TSV and PSV files return
-equivalent results. However, make sure that your registered storage plugins
+Follow these general guidelines for querying a plain text file:
+
+  * Use a storage plugin that defines the file format, such as comma-separated (CSV) or tab-separated
values (TSV), of the data in the plain text file.
+  * In the SELECT statement, use the `COLUMNS[n]` syntax in lieu of column names, which do
not exist in a plain text file. The first column is column `0`.
+  * In the FROM clause, use the path to the plain text file instead of using a table name.
Enclose the path and file name in back ticks. 
+
+Make sure that your registered storage plugins
 recognize the appropriate file types and extensions. For example, the
 following configuration expects PSV files (files with a pipe delimiter) to
 have a `tbl` extension, not a `psv` extension. Drill returns a "file not
@@ -117,3 +121,177 @@ example:
 Note that the restriction with the use of aliases applies to queries against
 all data sources.
 
+## Example of Querying a TSV File
+
+This example uses a tab-separated value (TSV) file that you download from a
+Google internet site. The data in the file consists of phrases from books that
+Google scans and generates for its [Google Books Ngram
+Viewer](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html). You
+use the data to find the relative frequencies of Ngrams. 
+
+### About the Data
+
+Each line in the TSV file has the following structure:
+
+`ngram TAB year TAB match_count TAB volume_count NEWLINE`
+
+For example, lines 1722089 and 1722090 in the file contain this data:
+
+<table ><tbody><tr><th >ngram</th><th >year</th><th
colspan="1" >match_count</th><th >volume_count</th></tr><tr><td
><p class="p1">Zoological Journal of the Linnean</p></td><td >2007</td><td
colspan="1" >284</td><td >101</td></tr><tr><td colspan="1"
><p class="p1">Zoological Journal of the Linnean</p></td><td colspan="1"
>2008</td><td colspan="1" >257</td><td colspan="1" >87</td></tr></tbody></table>

+  
+In 2007, "Zoological Journal of the Linnean" occurred 284 times overall in 101
+distinct books of the Google sample.
+
+### Download and Set Up the Data
+
+After downloading the file, you use the `dfs` storage plugin, and then select
+data from the file as you would a table. In the SELECT statement, enclose the
+path and name of the file in back ticks.
+
+  1. Download the compressed Google Ngram data from this location:  
+    
+     http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-5gram-20120701-zo.gz
+
+  2. Unzip the file.  
+     A file named googlebooks-eng-all-5gram-20120701-zo appears.
+
+  3. Change the file name to add a `.tsv` extension.  
+The Drill `dfs` storage plugin definition includes a TSV format that requires
+a file to have this extension.
+
+### Query the Data
+
+Get data about "Zoological Journal of the Linnean" that appears more than 250
+times a year in the books that Google scans.
+
+  1. Switch back to using the `dfs` storage plugin.
+  
+          USE dfs;
+
+  2. Issue a SELECT statement to get the first three columns in the file.  
+     * In the FROM clause of the example, substitute your path to the TSV file.  
+     * Use aliases to replace the column headers, such as EXPR$0, with user-friendly column
headers, Ngram, Publication Date, and Frequency.
+     * In the WHERE clause, enclose the string literal "Zoological Journal of the Linnean"
in single quotation marks.  
+     * Limit the output to 10 rows.  
+  
+         SELECT COLUMNS[0] AS Ngram,
+                COLUMNS[1] AS Publication_Date,
+                COLUMNS[2] AS Frequency
+         FROM `/Users/drilluser/Downloads/googlebooks-eng-all-5gram-20120701-zo.tsv`
+         WHERE ((columns[0] = 'Zoological Journal of the Linnean')
+             AND (columns[2] > 250)) LIMIT 10;
+
+     The output is:
+
+         +------------------------------------+-------------------+------------+
+         |               Ngram                | Publication_Date  | Frequency  |
+         +------------------------------------+-------------------+------------+
+         | Zoological Journal of the Linnean  | 1993              | 297        |
+         | Zoological Journal of the Linnean  | 1997              | 255        |
+         | Zoological Journal of the Linnean  | 2003              | 254        |
+         | Zoological Journal of the Linnean  | 2007              | 284        |
+         | Zoological Journal of the Linnean  | 2008              | 257        |
+         +------------------------------------+-------------------+------------+
+         5 rows selected (1.175 seconds)
+
+The Drill default storage plugins support common file formats. If you need
+support for some other file format, such as GZ, create a custom storage plugin. You can also
create a storage plugin to simplify querying file having long path names. A workspace name
replaces the long path name.
+
+
+## Create a Storage Plugin
+
+This example covers how to create and use a storage plugin to simplify queries or to query
a file type that `dfs` does not specify, GZ in this case. First, you create the storage plugin
in the Drill Web UI. Next, you connect to the
+file through the plugin to query a file.
+
+You can create a storage plugin using the Apache Drill Web UI to query the GZ file containing
the compressed TSV data directly.
+
+  1. Create an `ngram` directory on your file system.
+  2. Copy the GZ file `googlebooks-eng-all-5gram-20120701-zo.gz` to the `ngram` directory.
+  3. Open the Drill Web UI by navigating to <http://localhost:8047/storage>.   
+     To open the Drill Web UI, the [Drill shell]({{site.baseurl}}/docs/starting-drill-on-linux-and-mac-os-x/)
must still be running.
+  4. In New Storage Plugin, type `myplugin`.  
+     ![new plugin]({{ site.baseurl }}/docs/img/ngram_plugin.png)    
+  5. Click **Create**.  
+     The Configuration screen appears.
+  6. Replace null with the following storage plugin definition, except on the location line,
use the path to your `ngram` directory instead of the drilluser's path and give your workspace
an arbitrary name, for example, ngram:
+  
+        {
+          "type": "file",
+          "enabled": true,
+          "connection": "file:///",
+          "workspaces": {
+            "ngram": {
+              "location": "/Users/drilluser/ngram",
+              "writable": false,
+              "defaultInputFormat": null
+           }
+         },
+         "formats": {
+           "tsv": {
+             "type": "text",
+             "extensions": [
+               "gz"
+             ],
+             "delimiter": "\t"
+            }
+          }
+        }
+
+  7. Click **Create**.  
+     The success message appears briefly.
+  8. Click **Back**.  
+     The new plugin appears in Enabled Storage Plugins.  
+     ![new plugin]({{ site.baseurl }}/docs/img/ngram_plugin.png) 
+  9. Go back to the Drill shell, and list the storage plugins.  
+          SHOW DATABASES;
+
+          +---------------------+
+          |     SCHEMA_NAME     |
+          +---------------------+
+          | INFORMATION_SCHEMA  |
+          | cp.default          |
+          | dfs.default         |
+          | dfs.root            |
+          | dfs.tmp             |
+          | myplugin.default    |
+          | myplugin.ngram      |
+          | sys                 |
+          +---------------------+
+          8 rows selected (0.105 seconds)
+
+Your custom plugin appears in the list and has two workspaces: the `ngram`
+workspace that you defined and a default workspace.
+
+### Connect to and Query a File
+
+When querying the same data source repeatedly, avoiding long path names is
+important. This exercise demonstrates how to simplify the query. Instead of
+using the full path to the Ngram file, you use dot notation in the FROM
+clause.
+
+``<workspace name>.`<location>```
+
+This syntax assumes you connected to a storage plugin that defines the
+location of the data. To query the data source while you are _not_ connected to
+that storage plugin, include the plugin name:
+
+``<plugin name>.<workspace name>.`<location>```
+
+This exercise shows how to query Ngram data when you are connected to `myplugin`.
+
+  1. Connect to the ngram file through the custom storage plugin.  
+     `USE myplugin;`
+  2. Get data about "Zoological Journal of the Linnean" that appears more than 250 times
a year in the books that Google scans. In the FROM clause, instead of using the full path
to the file as you did in the last exercise, connect to the data using the storage plugin
workspace name ngram.
+  
+         SELECT COLUMNS[0], 
+                COLUMNS[1], 
+                COLUMNS[2] 
+         FROM ngram.`/googlebooks-eng-all-5gram-20120701-zo.gz` 
+         WHERE ((columns[0] = 'Zoological Journal of the Linnean') 
+          AND (columns[2] > 250)) 
+         LIMIT 10;
+
+     The five rows of output appear.  
+
+To continue with this example and query multiple files in a directory, see the section, ["Example
of Querying Multiple Files in a Directory"]({{site.baseurl}}/docs/querying-directories/#example-of-querying-multiple-files-in-a-directory).
+

http://git-wip-us.apache.org/repos/asf/drill/blob/55f25490/_docs/query-data/query-a-file-system/040-querying-directories.md
----------------------------------------------------------------------
diff --git a/_docs/query-data/query-a-file-system/040-querying-directories.md b/_docs/query-data/query-a-file-system/040-querying-directories.md
index 1a55b75..4a5b4ae 100644
--- a/_docs/query-data/query-a-file-system/040-querying-directories.md
+++ b/_docs/query-data/query-a-file-system/040-querying-directories.md
@@ -89,4 +89,38 @@ first level down from logs, `dir1` to the next level, and so on.
     +------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+
     10 rows selected (0.583 seconds)
 
+## Example of Querying Multiple Files in a Directory
+
+This example is a continuation of the example in the section, ["Example of Querying a TSV
File"]({{site.baseurl}}/docs/querying-plain-text-files/#example-of-querying-a-tsv-file) that
creates a subdirectory in the `ngram` directory and [custom plugin workspace]({{site.baseurl}}/docs/querying-plain-text-files/#create-a-storage-plugin)
you created earlier.
+
+You download a second Ngram file. Next, you
+move both Ngram GZ files you downloaded to the `ngram` subdirectory. Finally, using the custom
+plugin workspace, you query both files. In the FROM clause, simply reference
+the subdirectory.
+
+  1. Download a second file of compressed Google Ngram data from this location: 
+  
+     http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-2gram-20120701-ze.gz
+  2. Move `googlebooks-eng-all-2gram-20120701-ze.gz` to the `ngram/myfiles` subdirectory.

+  3. Move the 5gram file you downloaded earlier `googlebooks-eng-all-5gram-20120701-zo.gz`
to the `ngram/myfiles` subdirectory.
+  4. In the Drill shell, use the `myplugin.ngrams` workspace. 
+   
+          USE myplugin.ngram;
+  5. Query the myfiles directory for the "Zoological Journal of the Linnean" or "zero temperatures"
in books published in 1998.
+  
+          SELECT * 
+          FROM myfiles 
+          WHERE (((COLUMNS[0] = 'Zoological Journal of the Linnean')
+            OR (COLUMNS[0] = 'zero temperatures')) 
+            AND (COLUMNS[1] = '1998'));
+The output lists ngrams from both files.
+
+          +----------------------------------------------------------+
+          |                         columns                          |
+          +----------------------------------------------------------+
+          | ["Zoological Journal of the Linnean","1998","157","53"]  |
+          | ["zero temperatures","1998","628","487"]                 |
+          +----------------------------------------------------------+
+          2 rows selected (7.007 seconds)
+
 For more information about querying directories, see the section, ["Query Directory Functions"]({{site.baseurl}}/docs/query-directory-functions).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/drill/blob/55f25490/_docs/sql-reference/sql-functions/030-date-time-functions-and-arithmetic.md
----------------------------------------------------------------------
diff --git a/_docs/sql-reference/sql-functions/030-date-time-functions-and-arithmetic.md b/_docs/sql-reference/sql-functions/030-date-time-functions-and-arithmetic.md
index 23b0983..a6df716 100644
--- a/_docs/sql-reference/sql-functions/030-date-time-functions-and-arithmetic.md
+++ b/_docs/sql-reference/sql-functions/030-date-time-functions-and-arithmetic.md
@@ -46,7 +46,7 @@ Find the interval between midnight today, April 3, 2015, and June 13, 1957.
     +------------+
     1 row selected (0.064 seconds)
 
-Find the interval between midnight today, May 21, 2015, and hire dates of employees 578 and
761 in the employees.json file included with the Drill installation.
+Find the interval between midnight today, May 21, 2015, and hire dates of employees 578 and
761 in the `employees.json` file included with the Drill installation.
 
     SELECT AGE(CAST(hire_date AS TIMESTAMP)) FROM cp.`employee.json` where employee_id IN(
'578','761');
     +------------------+

http://git-wip-us.apache.org/repos/asf/drill/blob/55f25490/_docs/tutorials/020-drill-in-10-minutes.md
----------------------------------------------------------------------
diff --git a/_docs/tutorials/020-drill-in-10-minutes.md b/_docs/tutorials/020-drill-in-10-minutes.md
index 6584021..cf21743 100755
--- a/_docs/tutorials/020-drill-in-10-minutes.md
+++ b/_docs/tutorials/020-drill-in-10-minutes.md
@@ -45,7 +45,7 @@ Complete the following steps to install Drill:
 
 1. In a terminal windows, change to the directory where you want to install Drill.
 
-2. To download the latest version of Apache Drill, download Drill from the [Drill web site](http://getdrill.org/drill/download/apache-drill-1.0.0.tar.gz)or
run one of the following commands, depending on which you have installed on your system:
+2. To download the latest version of Apache Drill, download Drill from the [Drill web site](http://getdrill.org/drill/download/apache-drill-1.0.0.tar.gz)
or run one of the following commands, depending on which you have installed on your system:
 
    * `wget http://getdrill.org/drill/download/apache-drill-1.0.0.tar.gz`  
    *  `curl -o apache-drill-1.0.0.tar.gz http://getdrill.org/drill/download/apache-drill-1.0.0.tar.gz`
 


Mime
View raw message