Return-Path: X-Original-To: apmail-drill-commits-archive@www.apache.org Delivered-To: apmail-drill-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 95B6A1781A for ; Thu, 12 Mar 2015 01:05:15 +0000 (UTC) Received: (qmail 6741 invoked by uid 500); 12 Mar 2015 01:05:09 -0000 Delivered-To: apmail-drill-commits-archive@drill.apache.org Received: (qmail 6704 invoked by uid 500); 12 Mar 2015 01:05:09 -0000 Mailing-List: contact commits-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: commits@drill.apache.org Delivered-To: mailing list commits@drill.apache.org Received: (qmail 6695 invoked by uid 99); 12 Mar 2015 01:05:09 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Mar 2015 01:05:09 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id D71B0E10A9; Thu, 12 Mar 2015 01:05:08 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: bridgetb@apache.org To: commits@drill.apache.org Message-Id: X-Mailer: ASF-Git Admin Mailer Subject: drill git commit: DRILL-2397, new data types doc, misc other fixes Date: Thu, 12 Mar 2015 01:05:08 +0000 (UTC) Repository: drill Updated Branches: refs/heads/gh-pages-master eca98c77b -> df1b7e5a9 DRILL-2397, new data types doc, misc other fixes Project: http://git-wip-us.apache.org/repos/asf/drill/repo Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/df1b7e5a Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/df1b7e5a Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/df1b7e5a Branch: refs/heads/gh-pages-master Commit: df1b7e5a9397b02a880230e6dc51a07f2b1ff997 Parents: eca98c7 Author: Kristine Hahn Authored: Tue Mar 10 18:13:08 2015 -0700 Committer: Bridget Bevens Committed: Wed Mar 11 18:00:37 2015 -0700 ---------------------------------------------------------------------- _docs/005-connect.md | 6 +- _docs/connect/001-plugin-reg.md | 4 +- _docs/connect/002-plugin-conf.md | 39 +++-- _docs/data-sources/001-hive-types.md | 31 ++-- _docs/data-sources/003-parquet-ref.md | 12 +- _docs/data-sources/004-json-ref.md | 3 +- _docs/img/Hbase_Browse.png | Bin 147495 -> 148451 bytes _docs/img/connect-plugin.png | Bin 0 -> 24774 bytes _docs/query/004-query-complex.md | 2 +- _docs/sql-ref/001-data-types.md | 163 ++++++++++++------- _docs/sql-ref/002-lexical-structure.md | 28 ++-- _docs/sql-ref/004-functions.md | 2 +- .../data-types/002-disparate-data-types.md | 77 +++++++++ _docs/tutorial/002-get2kno-sb.md | 2 +- 14 files changed, 246 insertions(+), 123 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/drill/blob/df1b7e5a/_docs/005-connect.md ---------------------------------------------------------------------- diff --git a/_docs/005-connect.md b/_docs/005-connect.md index 3c60b2d..69bb407 100644 --- a/_docs/005-connect.md +++ b/_docs/005-connect.md @@ -1,7 +1,8 @@ --- title: "Connect to a Data Source" --- -A storage plugin is an interface for connecting to a data source to read and write data. Apache Drill connects to a data source, such as a file on the file system or a Hive metastore, through a storage plugin. When you execute a query, Drill gets the plugin name you provide in FROM clause of your query. +A storage plugin is an interface for connecting to a data source to read and write data. Apache Drill connects to a data source, such as a file on the file system or a Hive metastore, through a storage plugin. When you execute a query, Drill gets the plugin name you provide in FROM clause of your query or from the default you specify in the USE. command that precedes the query. +. In addition to the connection string, the storage plugin configures the workspace and file formats for reading and writing data, as described in subsequent sections. @@ -18,8 +19,7 @@ A storage plugin provides the following information to Drill: * Interfaces that Drill can use to read from and write to data sources * A set of storage plugin optimization rules that assist with efficient and faster execution of Drill queries, such as pushdowns, statistics, and partition awareness -A storage plugin performs scanner and writer functions, and informs the metadata -repository of any known metadata, such as: +A storage plugin performs scanner and writer functions, and informs the metadata repository of any known metadata. The metadata repository is a database created to store metadata. The metadata is information about the structures that contain the actual data, such as: * Schema * File size http://git-wip-us.apache.org/repos/asf/drill/blob/df1b7e5a/_docs/connect/001-plugin-reg.md ---------------------------------------------------------------------- diff --git a/_docs/connect/001-plugin-reg.md b/_docs/connect/001-plugin-reg.md index 8def0bb..b3c5cdc 100644 --- a/_docs/connect/001-plugin-reg.md +++ b/_docs/connect/001-plugin-reg.md @@ -2,14 +2,14 @@ title: "Storage Plugin Registration" parent: "Connect to a Data Source" --- -You connect Drill to a file system, Hive, HBase, or other data source using storage plugins. Drill includes a number storage plugins in the installation. On the Storage tab of the Web UI, you can view, create, reconfigure, and register a storage plugin. To open the Storage tab, go to `http://:8047/storage`: +You connect Drill to a file system, Hive, HBase, or other data source using storage plugins. Drill includes a number of storage plugins in the installation. On the Storage tab of the Web UI, you can view, create, reconfigure, and register a storage plugin. To open the Storage tab, go to `http://:8047/storage`, where IP address is any one of the installed drill bits: ![drill-installed plugins]({{ site.baseurl }}/docs/img/plugin-default.png) The Drill installation registers the `cp`, `dfs`, `hbase`, `hive`, and `mongo` storage plugins instances by default. * `cp` - Points to a JAR file in the Drill classpath that contains sample data that you can query. + Points to a JAR file in the Drill classpath that contains the Transaction Processing Performance Council (TPC) benchmark schema TPC-H that you can query. * `dfs` Points to the local file system on your machine, but you can configure this instance to point to any distributed file system, such as a Hadoop or S3 file system. http://git-wip-us.apache.org/repos/asf/drill/blob/df1b7e5a/_docs/connect/002-plugin-conf.md ---------------------------------------------------------------------- diff --git a/_docs/connect/002-plugin-conf.md b/_docs/connect/002-plugin-conf.md index 632ed3f..caef2ad 100644 --- a/_docs/connect/002-plugin-conf.md +++ b/_docs/connect/002-plugin-conf.md @@ -3,8 +3,8 @@ title: "Storage Plugin Configuration" parent: "Connect to a Data Source" --- When you add or update storage plugin instances on one Drill node in a Drill -cluster, Drill broadcasts the information to all of the other Drill nodes so -they all have identical storage plugin configurations. You do not need to +cluster, Drill broadcasts the information to all of the other Drill nodes +to have identical storage plugin configurations. You do not need to restart any of the Drillbits when you add or update a storage plugin instance. Use the Drill Web UI to update or add a new storage plugin. Launch a web browser, go to: `http://:8047`, and then go to the Storage tab. @@ -21,6 +21,11 @@ name. Names are case-sensitive. Click Update to reconfigure an existing, enabled storage plugin. ## Storage Plugin Attributes +The following diagram of the dfs storage plugin briefly describes options you configure in a typical storage plugin configuration: + +![dfs plugin]({{ site.baseurl }}/docs/img/connect-plugin.png) + +The following table describes the attributes you configure for storage plugins in more detail than the diagram. @@ -33,72 +38,74 @@ Click Update to reconfigure an existing, enabled storage plugin. - + - + - + - + - + - - + + - - + + - + - - + + - + - +
"type" "file"
"hbase"
"hive"
"mongo"
yesThe storage plugin type name supported by DrillThe storage plugin type name supported by Drill.
"enabled" true
false
yesThe state of the storage pluginThe state of the storage plugin.
"connection" "classpath:///"
"file:///"
"mongodb://localhost:27017/"
"maprfs:///"
implementation-dependentThe type of distributed filesystem. Drill can work with any distributed system, such as HDFS and S3, or JSON or CSV file in your file system.The type of distributed file system. Drill can work with any distributed system, such as HDFS and S3, or files in your file system.
"workspaces" null
"logs"
noOne or more arbitrary workspace names, enclosed in double quotation marks. Workspace names are case sensitive.One or more unique workspace names, enclosed in double quotation marks. If a workspace is defined more than once, the latest one overrides the previous ones. Not used with local or distributed file systems.
"workspaces". . . "location" "location": "/"
"location": "/tmp"
noAn alias for a path to a directory, used to shorten references to files in the a query. The path to a directory on the file system.
"workspaces". . . "writable" true
false
yesIndicates whether or not users can write data to this locationnoOne or more unique workspace names, enclosed in double quotation marks. If a workspace is defined more than once, the latest one overrides the previous ones. Not used with local or distributed file systems.
"workspaces". . . "defaultInputFormat" null
"parquet"
"csv"
"json"
yesThe format of data written by executing CREATE TABLE AS and CREATE VIEW commandsnoThe format of data Drill reads by default, regardless of extension. Parquet is the default.
"formats" "psv"
"csv"
"tsv"
"parquet"
"json"
"maprdb"
yesOne or more file formats of data read by Drill. Drill can detect some file formats based on the file extension or the first few bits of data within the file, but needs format information for others. One or more file formats of data Drill can read. Drill can implicitly detect some file formats based on the file extension or the first few bits of data within the file, but you need to configure an option for others.
"formats" . . . "type" "text"
"parquet"
"json"
"maprdb"
noThe type of data in the file. Although Drill can work with different file types in the same directory, restricting a Drill workspace to one file type prevents confusion. yesThe type of the format specified. For example, you can define two formats, csv and psv, as type "Text", but having different delimiters. Drill enables the maprdb plugin if you define the maprdb type.
formats . . . "extensions" ["csv"] format-dependentThe extension of the file.The extensions of the files that Drill can read.
"formats" . . . "delimiter" "\t"
","
format-dependentThe delimiter used to separate columns in text files such as CSV.The delimiter used to separate columns in text files such as CSV. Specify a non-printable delimiter in the storage plugin config by using the form \uXXXX, where XXXX is the four numeral hex ascii code for the character.
The configuration of other attributes, such as `size.calculator.enabled` in the hbase plugin and `configProps` in the hive plugin, are implementation-dependent and beyond the scope of this document. +Although Drill can work with different file types in the same directory, restricting a Drill workspace to one file type prevents confusion. + ## Case-sensitive Names As previously mentioned, workspace and storage plugin names are case-sensitive. For example, the following query uses a storage plugin name `dfs` and a workspace name `clicks`. When you refer to `dfs.clicks` in an SQL statement, use the defined case: http://git-wip-us.apache.org/repos/asf/drill/blob/df1b7e5a/_docs/data-sources/001-hive-types.md ---------------------------------------------------------------------- diff --git a/_docs/data-sources/001-hive-types.md b/_docs/data-sources/001-hive-types.md index c6cdb90..a4721bf 100644 --- a/_docs/data-sources/001-hive-types.md +++ b/_docs/data-sources/001-hive-types.md @@ -5,7 +5,7 @@ parent: "Data Sources and File Formats" Using Drill you can read tables created in Hive that use data types compatible with Drill. Drill currently does not support writing Hive tables. The following table shows Drill support for Hive primitive types: - + @@ -80,25 +80,16 @@ Using Drill you can read tables created in Hive that use data types compatible w - + - - - - - - + - - - - - + @@ -120,7 +111,7 @@ The Hive version used in MapR supports the Hive timestamp in Unix Epoch format. ## Type Mapping Example This example demonstrates the mapping of Hive data types to Drill data types. Using a CSV that has the following contents, you create a Hive table having values of different supported types: - 100005,true,3.5,-1231.4,3.14,42,"SomeText",2015-03-25,2015-03-25 01:23:15 + 8223372036854775807,true,3.5,-1231.4,3.14,42,"SomeText",2015-03-25,2015-03-25 01:23:15 The example assumes that the CSV resides on the MapR file system (MapRFS) in the Drill sandbox: `/mapr/demo.mapr.com/data/` @@ -144,7 +135,7 @@ You check that Hive mapped the data from the CSV to the typed values as as expec hive> SELECT * FROM types_demo; OK - 100005 true 3.5 -1231.4 3.14 42 "SomeText" 2015-03-25 2015-03-25 01:23:15 + 8223372036854775807 true 3.5 -1231.4 3.14 42 "SomeText" 2015-03-25 2015-03-25 01:23:15 Time taken: 0.524 seconds, Fetched: 1 row(s) In Drill, you use the Hive storage plugin that has the following definition. @@ -171,11 +162,11 @@ Using the Hive storage plugin connects Drill to the Hive metastore containing th The data in the Hive table shows the expected values. 0: jdbc:drill:> SELECT * FROM hive.`types_demo`; - +--------+------+------+---------+------+----+------------+------------+-----------+ - | a | b | c | d | e | f | g | h | i | - +------------+---------+---------+------+----+------------+------------+-----------+ - | 100005 | true | 3.50 | -1231.4 | 3.14 | 42 | "SomeText" | 2015-03-25 | 2015-03-25 01:23:15.0 | - +--------+------+------+---------+------+----+------------+------------+-----------+ + +---------------------+------+------+---------+------+----+------------+------------+-----------+ + | a | b | c | d | e | f | g | h | i | + +---------------------+---------+---------+------+----+------------+------------+-----------+ + | 8223372036854775807 | true | 3.50 | -1231.4 | 3.14 | 42 | "SomeText" | 2015-03-25 | 2015-03-25 01:23:15.0 | + +---------------------+------+------+---------+------+----+------------+------------+-----------+ 1 row selected (1.262 seconds) To validate that Drill interprets the timestamp in column i correctly, use the extract function to extract part of the date: http://git-wip-us.apache.org/repos/asf/drill/blob/df1b7e5a/_docs/data-sources/003-parquet-ref.md ---------------------------------------------------------------------- diff --git a/_docs/data-sources/003-parquet-ref.md b/_docs/data-sources/003-parquet-ref.md index 71ed851..4f04463 100644 --- a/_docs/data-sources/003-parquet-ref.md +++ b/_docs/data-sources/003-parquet-ref.md @@ -16,13 +16,13 @@ Apache Drill includes the following support for Parquet: * Querying self-describing data in files or NoSQL databases without having to define and manage schema overlay definitions in centralized metastores * Creating Parquet files from other file formats, such as JSON, without any set up * Generating Parquet files that have evolving or changing schemas and querying the data on the fly -* Handling Parquet scalar and complex data types, such as maps and arrays +* Handling Parquet data types ### Reading Parquet Files When a read of Parquet data occurs, Drill loads only the necessary columns of data, which reduces I/O. Reading only a small piece of the Parquet data from a data file or table, Drill can examine and analyze all values for a column across multiple files. You can create a Drill table from one format and store the data in another format, including Parquet. ### Writing Parquet Files -CTAS can use any data source provided by the storage plugin. To write Parquet data using the CTAS command, set the session store.format option as shown in the next section. Alternatively, configure the storage plugin to point to the directory containing the Parquet files. +CREATE TABLE AS SELECT (CTAS) can use any data source provided by the storage plugin. To write Parquet data using the CTAS command, set the session store.format option as shown in the next section. Alternatively, configure the storage plugin to point to the directory containing the Parquet files. Although the data resides in a single table, Parquet output generally consists of multiple files that resemble MapReduce output having numbered file names, such as 0_0_0.parquet in a directory. @@ -56,7 +56,7 @@ The following general process converts a file from JSON to Parquet: * Create or use an existing storage plugin that specifies the storage location of the Parquet file, mutability of the data, and supported file formats. * Take a look at the JSON data. * Create a table that selects the JSON file. -* In the CTAS command, cast JSON string data to corresponding SQL types. +* In the CTAS command, cast JSON string data to corresponding [SQL types](/docs/json-data-model/data-type-mapping). ### Example: Read JSON, Write Parquet This example demonstrates a storage plugin definition, a sample row of data from a JSON file, and a Drill query that writes the JSON input to Parquet output. @@ -65,7 +65,7 @@ This example demonstrates a storage plugin definition, a sample row of data from You can use the default dfs storage plugin installed with Drill for reading and writing Parquet files. The storage plugin needs to configure the writable option of the workspace to true, so Drill can write the Parquet output. The dfs storage plugin defines the tmp writable workspace, which you can use in the CTAS command to create a Parquet table. #### Sample Row of JSON Data -A JSON file contains data consisting of strings, typical of JSON data. The following example shows one row of the JSON file: +A JSON file called sample.json contains data consisting of strings, typical of JSON data. The following example shows one row of the JSON file: {"trans_id":0,"date":"2013-07-26","time":"04:56:59","amount":80.5,"user_info": {"cust_id":28,"device":"WEARABLE2","state":"mt" @@ -80,7 +80,7 @@ A JSON file contains data consisting of strings, typical of JSON data. The follo #### CTAS Query -The following example shows a CTAS query that creates a table from JSON data. The command casts the date, time, and amount strings to SQL types DATE, TIME, and DOUBLE. String-to-VARCHAR casting of the other strings occurs automatically. +The following example shows a CTAS query that creates a table from JSON data shown in the last example. The command casts the date, time, and amount strings to SQL types DATE, TIME, and DOUBLE. String-to-VARCHAR casting of the other strings occurs automatically. CREATE TABLE dfs.tmp.sampleparquet AS (SELECT trans_id, @@ -90,7 +90,7 @@ The following example shows a CTAS query that creates a table from JSON data. Th user_info, marketing_info, trans_info FROM dfs.`/Users/drilluser/sample.json`); -The CTAS query did not specify a file name extension, so Drill used the default parquet file name extension. The output is a Parquet file: +The CTAS query does not specify a file name extension for the output. Drill creates a parquet file by default, as indicated by the file name in the output: +------------+---------------------------+ | Fragment | Number of records written | http://git-wip-us.apache.org/repos/asf/drill/blob/df1b7e5a/_docs/data-sources/004-json-ref.md ---------------------------------------------------------------------- diff --git a/_docs/data-sources/004-json-ref.md b/_docs/data-sources/004-json-ref.md index f865a8a..65b5b7c 100644 --- a/_docs/data-sources/004-json-ref.md +++ b/_docs/data-sources/004-json-ref.md @@ -70,7 +70,7 @@ When you set this option, Drill reads all data from the JSON files as VARCHAR. A * Cast JSON numeric values to [SQL types](/docs/data-types), such as BIGINT, DECIMAL, FLOAT, INTEGER, and SMALLINT. * Cast JSON strings to [Drill Date/Time Data Type Formats](/docs/supported-date-time-data-type-formats). -Drill uses [map and array data types](/docs/data-types) internally for reading and writing complex and nested data structures from JSON. You can cast data in a map or array of data to return a value from the structure, as shown in [“Create a view on a MapR-DB table”] (/docs/lession-2-run-queries-with-ansi-sql). “Query Complex Data” shows how to access nested arrays, for example. +Drill uses [map and array data types](/docs/data-types) internally for reading complex and nested data structures from JSON. You can cast data in a map or array of data to return a value from the structure, as shown in [“Create a view on a MapR-DB table”] (/docs/lession-2-run-queries-with-ansi-sql). “Query Complex Data” shows how to access nested arrays. ## Reading JSON To read JSON data using Drill, use a [file system storage plugin](/docs/connect-to-a-data-source) that defines the JSON format. You can use the `dfs` storage plugin, which includes the definition. @@ -501,3 +501,4 @@ Workaround: Query each file individually. + http://git-wip-us.apache.org/repos/asf/drill/blob/df1b7e5a/_docs/img/Hbase_Browse.png ---------------------------------------------------------------------- diff --git a/_docs/img/Hbase_Browse.png b/_docs/img/Hbase_Browse.png index 729e0f8..014ceca 100644 Binary files a/_docs/img/Hbase_Browse.png and b/_docs/img/Hbase_Browse.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/df1b7e5a/_docs/img/connect-plugin.png ---------------------------------------------------------------------- diff --git a/_docs/img/connect-plugin.png b/_docs/img/connect-plugin.png new file mode 100644 index 0000000..d47bee3 Binary files /dev/null and b/_docs/img/connect-plugin.png differ http://git-wip-us.apache.org/repos/asf/drill/blob/df1b7e5a/_docs/query/004-query-complex.md ---------------------------------------------------------------------- diff --git a/_docs/query/004-query-complex.md b/_docs/query/004-query-complex.md index 537d7b4..96c06eb 100644 --- a/_docs/query/004-query-complex.md +++ b/_docs/query/004-query-complex.md @@ -43,7 +43,7 @@ The examples in this section operate on JSON data files. In order to write your own queries, you need to be aware of the basic data types in these files: * string (all data inside double quotes), such as `"0001"` or `"Cake"` - * numeric types: integers, decimals, and floats, such as `0.55` or `10` + * number: integers, decimals, and floats, such as `0.55` or `10` * null values * boolean values: true, false http://git-wip-us.apache.org/repos/asf/drill/blob/df1b7e5a/_docs/sql-ref/001-data-types.md ---------------------------------------------------------------------- diff --git a/_docs/sql-ref/001-data-types.md b/_docs/sql-ref/001-data-types.md index 491898a..094fd87 100644 --- a/_docs/sql-ref/001-data-types.md +++ b/_docs/sql-ref/001-data-types.md @@ -2,76 +2,119 @@ title: "Data Types" parent: "SQL Reference" --- -You can use the following SQL data types in your Drill queries: +You can use the following SQL data types in Drill queries: + +
SQL TypeSupported SQL Type Hive Type Drill Description
Conventional UNIX Epoch timestamp.
N/ANone TIMESTAMP JDBC timestamp in yyyy-mm-dd hh:mm:ss format
TIMESTAMPTZN/AHours ahead of or behind Coordinated Universal Time (UTC) or regional hours and minutes
N/ANone STRING Binary string (16)
BINARYBINARYBinary string
VARCHAR VARCHAR
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
SQL TypeDrill Description
BIGINT8-byte signed integer
BOOLEANTRUE (1) or FALSE (0)
DATEYears months and days in the form in the form YYYY-­MM-­DD
DECIMAL38-digit precision
FLOAT4-byte single precision floating point number
DOUBLE8-byte double precision floating point number
INTEGER4-byte signed integer
INTERVALInteger fields representing a period of time depending on the type of interval
INTERVALDAYInteger fields representing a day
INTERVALYEARInteger fields representing a year
SMALLINT2-byte signed integer. Supported in Drill 0.9 and later.
TIMEHours minutes seconds 24-hour basis
TIMESTAMPConventional UNIX Epoch timestamp.
VARCHARCharacter string variable length
+## Complex Data Types -#### Character - - * VARCHAR/CHAR - -#### Date/Time - - * DATE - * INTERVAL - * Interval Year (stores year and month) - * Interval Day (stores day, hour, minute, seconds, and milliseconds) - * TIME - * TIMESTAMP - -Refer to [Supported Date/Time Data Type formats](/docs/supported-date-time-data-type-formats/). +Complex and nested data structures in JSON and Parquet files are of map and array types. -#### Integer +* A map is a set of name/value pairs. + A value in a map can be a scalar type, such as string or int, or a complex type, such as an array or another map. +* An array is a repeated list of values. + A value in an array can be a scalar type, such as string or int, or an array can be a complex type, such as a map or another array. - * BIGINT - * INT - * SMALLINT +Drill reads/writes maps and arrays from/to JSON and Parquet files. In Drill, you do not cast a map or array to another type. -#### Numeric - * DECIMAL - * FLOAT - * DOUBLE PRECISION (FLOAT 8) - * REAL (FLOAT 4) +The following example shows a JSON map having scalar values: -#### Boolean + phoneNumber: + { + areaCode: 622, + number: 1567845 + } -Values are FALSE or TRUE. +The following example shows a JSON map having an array as a value: -## Complex Data Types + { citiesLived : + [ + { place : Los Angeles, + yearsLived : [ 1989, 1993, 1998, 2002] + } + ] + } -Drill provides map and array data types to work with complex and nested data -structures. For analysis of complex data, a more modern JSON-style approach to -writing queries is more effective than using standard SQL functions. +The following example shows a JSON array having scalar values: -The following table provides descriptions and examples of the complex data -types: + yearsLived: [1990, 1993, 1998, 2008] - - - - - - - - - - - - - - -
Data TypeDescriptionExample
MapA map is a set of name/value pairs.
- A value in an map can be a scalar type,
- such as string or int, or a map can be a
- complex type, such as an array or another map.
Map with scalar type values:
  "phoneNumber": { "areaCode": "622", "number": "1567845"}
Map with complex type value:
  { "citiesLived" : [ { "place" : "Los Angeles",
-     "yearsLived" : [ "1989",
-      "1993",
-     "1998",
-     "2002"
-      ]
-    -  } ] }
ArrayAn array is a repeated list of values.
- A value in an array can be a scalar type,
- such as string or int, or an array can be a
- complex type, such as a map or another array.
Array with scalar values:
  "yearsLived": ["1990", "1993", "1998", "2008"]
Array with complex type values:
  "children":
  [ { "age" : "10",
      "gender" : "Male",
      "name" : "Earl"
     },
    { "age" : "6",
      "gender" : "Male",
      "name" : "Sam"
    },
    { "age" : "8",
    "gender" : "Male",
    "name" : "Kit"
     ;}
  ]
+The following example shows a JSON array having complex type values: + children: + [ + { age : 10, + gender : Male, + name : Earl + }, + { age : 6, + gender : Male, + name : Sam, + { age : 8, + gender : Male, + name : Kit + } + ] + http://git-wip-us.apache.org/repos/asf/drill/blob/df1b7e5a/_docs/sql-ref/002-lexical-structure.md ---------------------------------------------------------------------- diff --git a/_docs/sql-ref/002-lexical-structure.md b/_docs/sql-ref/002-lexical-structure.md index 971bb9f..5115ab8 100644 --- a/_docs/sql-ref/002-lexical-structure.md +++ b/_docs/sql-ref/002-lexical-structure.md @@ -3,10 +3,13 @@ title: "Lexical Structure" parent: "SQL Reference" --- -A SQL statement used in Drill can include the following parts: +A SQL statement used in Drill can include one or more of the following parts: -* [Storage plugin and workspace references](/docs/lexical-structure#storage-plugin-and-workspace-references) -* Literal values +* Clause, such as FROM +* Command, such as SELECT +* Expression, a combination of one or more values, operators, and SQL functions that evaluates to a value. For example, users.firstname is a period expression +* Function, scalar and aggregate, such as sum +* Literal value * [Boolean](/docs/lexical-structure#boolean) * [Identifier](/docs/lexical-structure#identifier) @@ -14,11 +17,12 @@ A SQL statement used in Drill can include the following parts: * [Numeric constant](/docs/lexical-structure#numeric-constant) * [String](/docs/lexical-structure#string) -* Expressions, such as t.metric * 3.1415927 -* Functions, such as count(*) -* Names of commands and clauses, such as `SELECT * FROM myfile WHERE a > b`. +* Operator, such as [NOT] IN, LIKE, and AND +* Predicate, such as a > b in `SELECT * FROM myfile WHERE a > b`. +* [Storage plugin and workspace reference](/docs/lexical-structure#storage-plugin-and-workspace-references) +* Whitespace -The upper/lowercase sensitivity of the parts differ. +The upper/lowercase sensitivity of the parts differs. ## Case-sensitivity @@ -30,7 +34,7 @@ The sys.options table name and values are case-sensitive. The following query wo SELECT * FROM sys.options where NAME like '%parquet%'; -When using the ALTER command, specify the name lower case. For example: +When using the ALTER command, specify the name in lower case. For example: ALTER SESSION set `store.parquet.compression`='snappy'; @@ -40,7 +44,7 @@ Storage plugin and workspace names are case-sensitive. The case of the name used SELECT * FROM dfs.`/Users/drilluser/ticket_sales.json`; -## Literal values +## Literal Values This section describes how to construct literals. @@ -53,7 +57,7 @@ An identifier is a letter followed by any sequence of letters, digits, or the un * Keywords * Identifiers that SQL cannot parse. -For example, you enclose the SQL keywords date and time in back ticks in the example in section, ["Example: Read JSON, Write Parquet"](/docs/parquet-format#example-read-json-write-parquet): +For example, enclose the SQL keywords date and time in back ticks when referring to column names, but not when referring to data types: CREATE TABLE dfs.tmp.sampleparquet AS (SELECT trans_id, @@ -65,7 +69,7 @@ For example, you enclose the SQL keywords date and time in back ticks in the exa Table and column names are case-insensitive. Use back ticks to enclose names that contain special characters. Special characters are those other than the 52 Latin alphabet characters. For example, space and @ are special characters. -The following example shows the keyword Year and the column alias "Customer Number" enclosed in back ticks. The column alias contains the special space character. +The following example shows the keyword Year enclosed in back ticks. Because the column alias contains the special space character, also enclose the alias in back ticks, as shown in the following example: SELECT extract(year from transdate) AS `Year`, t.user_info.cust_id AS `Customer Number` FROM dfs.tmp.`sampleparquet` t; @@ -130,7 +134,7 @@ Strings are characters enclosed in single quotation marks. To use a single quota +------------+ 2 rows selected (0.053 seconds) -To refer to the string Martha's Vineyard in a query, use single quotation marks to enclose the string. and escape the apostophe using a single quotation mark: +To refer to the string Martha's Vineyard in a query, use single quotation marks to enclose the string and escape the apostophe using a single quotation mark: SELECT * FROM dfs.`/Users/drilluser/vitalstat.json` t WHERE t.source = 'Martha''s Vineyard'; http://git-wip-us.apache.org/repos/asf/drill/blob/df1b7e5a/_docs/sql-ref/004-functions.md ---------------------------------------------------------------------- diff --git a/_docs/sql-ref/004-functions.md b/_docs/sql-ref/004-functions.md index c405a58..03cd6ae 100644 --- a/_docs/sql-ref/004-functions.md +++ b/_docs/sql-ref/004-functions.md @@ -40,7 +40,7 @@ Function| Return Type char_length(string) or character_length(string)| int concat(str "any" [, str "any" [, ...] ])| text convert_from(string text, src_encoding name)| text -convert_to(string text, dest_encoding name)| +convert_to(string text, dest_encoding name)| byte array initcap(string)| text left(str text, n int)| text length(string)| int http://git-wip-us.apache.org/repos/asf/drill/blob/df1b7e5a/_docs/sql-ref/data-types/002-disparate-data-types.md ---------------------------------------------------------------------- diff --git a/_docs/sql-ref/data-types/002-disparate-data-types.md b/_docs/sql-ref/data-types/002-disparate-data-types.md new file mode 100644 index 0000000..be9dbc4 --- /dev/null +++ b/_docs/sql-ref/data-types/002-disparate-data-types.md @@ -0,0 +1,77 @@ +--- +title: "Handling Disparate Data Types" +parent: "Data Types" +--- +[Previous](/docs/supported-date-time-data-type-formats)                        [Back to Table of Contents](/docs)                        [Next](/docs/lexical-structure) + +The file format of the data and planned queries determine the casting or converting required, if any. When Drill reads schema-less data into SQL tables for querying, you might need to cast one data type to another explicitly. In some cases, Drill converts schema-less data to typed data implicitly. In this case, you do not need to cast. Drill does not implicitly cast HBase binary data. You use convert_to and convert_from functions to work with HBase data in Drill. + +With respect to data types, Drill treats data from these sources as follows: + +* HBase + No implicit casting to SQL types. Convert data to appropriate types as shown in ["Querying HBase."](/docs/querying-hbase/) +* Hive + Implicitly casts Hive types to SQL types as shown in the Hive [type mapping example](/docs/hive-to-drill-data-type-mapping#type-mapping-example) +* JSON + Implicitly casts JSON data to SQL types as shown in the [SQL and JSON type mapping table](/docs/json-data-model#data-type-mapping) of the JSON Data Model documentation. +* MapR-DB + Implicitly casts MapR-DB data to SQL types when you use the maprdb format for reading MapR-DB data. The dfs storage plugin defines the format when you install Drill from the mapr-drill package on a MapR node. +* Parquet + Implicitly casts JSON data to the SQL types shown in [SQL Data Types to Parquet](/docs/parquet-format/sql-data-types-to-parquet). +* Text: CSV, TSV, and other text + Implicitly casts all textual data to varchar. + +Drill supports a number of functions to cast and convert compatible data types: + +* cast + Casts textual data from one data type to another. +* convert_to and convert_from + Converts binary data from one data type to another. +* to_char + Converts a timestamp, interval, integer, real/double precision, or decimal to a string. +* to_date + Converts a string to a date. +* to_number + Converts a string to a decimal. +* to_timestamp + Converts a string to a timestamp. + +### Using Cast + +Embed a cast function in a query using this syntax: + + cast AS + +* expression + A entity that has single data value, such as a column name, of the data type you want to cast to a different type +* data type + The target data type, such as INTEGER or DATE + +Example: Inspect integer data and cast data to a decimal + + SELECT c_row, c_int FROM mydata WHERE c_row = 9; + c_row | c_int + ------+------------ + 9 | -2147483648 + (1 row) + + SELECT c_row, cast(c_int as decimal(28,8)) FROM my_data WHERE c_row = 9; + c_row | c_int + ------+--------------------- + 9 | -2147483648.00000000 + (1 row) + +If the SELECT statement includes a WHERE clause that compares a column of an unknown data type, cast both the value of the column and the comparison value in the WHERE clause. For example: + + SELECT c_row, cast(c_int as decimal(28,8)) FROM mydata WHERE cast(c_int as decimal(28,8)) > -3.0 + +Although you can use cast to handle binary data, convert_to and convert_from are recommended for use with binary data. + +### Using convert_to and convert_from + +To query HBase data in Drill, convert every column of an HBase table to/from byte arrays from/to an [SQL data type](/docs/data-types/) that Drill supports when writing/reading data. For examples of how to use these functions, see ["Convert and Cast Functions".](/docs/sql-functions#convert-and-cast-functions). + +## Handling Textual Data +In a textual file, such as CSV, Drill interprets every field as a varchar, as previously mentioned. In addition to using the cast function, you can also use [to_char](link), [to_date](line), [to_number](link), and [to_timestamp](link). If the SELECT statement includes a WHERE clause that compares a column of an unknown data type, cast both the value of the column and the comparison value in the WHERE clause. + + http://git-wip-us.apache.org/repos/asf/drill/blob/df1b7e5a/_docs/tutorial/002-get2kno-sb.md ---------------------------------------------------------------------- diff --git a/_docs/tutorial/002-get2kno-sb.md b/_docs/tutorial/002-get2kno-sb.md index 3850daf..ee2b5dd 100644 --- a/_docs/tutorial/002-get2kno-sb.md +++ b/_docs/tutorial/002-get2kno-sb.md @@ -24,7 +24,7 @@ In this tutorial you query a number of data sets, including Hive and HBase, and ## Storage Plugin Overview This section describes storage plugins included in the sandbox. For general information about Drill storage plugins, see ["Connect to a Data Source"](/docs/connect-to-data-sources). -Take a look at the pre-configured storage plugins for the sandbox by opening the Storage tab in the Drill Web UI. Launch a web browser and go to: `http://:8047/storage`. For example: +Take a look at the pre-configured storage plugins for the sandbox by opening the Storage tab in the Drill Web UI. Launch a web browser and go to: `http://:8047/storage`. For example: http://localhost:8046/storage