Return-Path: X-Original-To: apmail-drill-commits-archive@www.apache.org Delivered-To: apmail-drill-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2957218086 for ; Tue, 4 Aug 2015 23:20:51 +0000 (UTC) Received: (qmail 75044 invoked by uid 500); 4 Aug 2015 23:20:51 -0000 Delivered-To: apmail-drill-commits-archive@drill.apache.org Received: (qmail 75004 invoked by uid 500); 4 Aug 2015 23:20:51 -0000 Mailing-List: contact commits-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: commits@drill.apache.org Delivered-To: mailing list commits@drill.apache.org Received: (qmail 74991 invoked by uid 99); 4 Aug 2015 23:20:51 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Aug 2015 23:20:51 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id D14EFE0418; Tue, 4 Aug 2015 23:20:50 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: bridgetb@apache.org To: commits@drill.apache.org Date: Tue, 04 Aug 2015 23:20:50 -0000 Message-Id: <8a15c2da4cdf4e21a38de5c3db299601@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [1/2] drill git commit: Daniel's review Repository: drill Updated Branches: refs/heads/gh-pages f7515b69c -> 98dfbea8b Daniel's review Project: http://git-wip-us.apache.org/repos/asf/drill/repo Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/1fc4d00c Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/1fc4d00c Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/1fc4d00c Branch: refs/heads/gh-pages Commit: 1fc4d00cfab6524a966285bbea19aac10fc59f9a Parents: f7515b6 Author: Kristine Hahn Authored: Mon Jul 27 15:41:24 2015 -0700 Committer: Kristine Hahn Committed: Mon Jul 27 15:43:20 2015 -0700 ---------------------------------------------------------------------- .../010-connect-a-data-source-introduction.md | 4 +- .../020-storage-plugin-registration.md | 8 +- .../035-plugin-configuration-basics.md | 30 +++--- .../040-file-system-storage-plugin.md | 96 +++++++++++--------- _docs/connect-a-data-source/050-workspaces.md | 22 +++-- .../060-hbase-storage-plugin.md | 7 +- .../070-hive-storage-plugin.md | 26 +++--- .../080-drill-default-input-format.md | 73 ++++++--------- .../090-mongodb-plugin-for-apache-drill.md | 39 +++----- .../050-json-data-model.md | 2 +- .../030-querying-plain-text-files.md | 7 +- .../005-about-the-mapr-sandbox.md | 9 +- 12 files changed, 152 insertions(+), 171 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/drill/blob/1fc4d00c/_docs/connect-a-data-source/010-connect-a-data-source-introduction.md ---------------------------------------------------------------------- diff --git a/_docs/connect-a-data-source/010-connect-a-data-source-introduction.md b/_docs/connect-a-data-source/010-connect-a-data-source-introduction.md index 2ba46e3..7e4e2b5 100644 --- a/_docs/connect-a-data-source/010-connect-a-data-source-introduction.md +++ b/_docs/connect-a-data-source/010-connect-a-data-source-introduction.md @@ -2,9 +2,9 @@ title: "Connect a Data Source Introduction" parent: "Connect a Data Source" --- -A storage plugin is a software module for connecting Drill to data sources. A storage plugin typically optimizes execution of Drill queries, provides the location of the data, and configures the workspace and file formats for reading data. Several storage plugins are installed with Drill that you can configure to suit your environment. Through the storage plugin, Drill connects to a data source, such as a database, a file on a local or distributed file system, or a Hive metastore. +A storage plugin is a software module for connecting Drill to data sources. A storage plugin typically optimizes execution of Drill queries, provides the location of the data, and configures the workspace and file formats for reading data. Several storage plugins are installed with Drill that you can configure to suit your environment. Through a storage plugin, Drill connects to a data source, such as a database, a file on a local or distributed file system, or a Hive metastore. -You can modify the default configuration of a storage plugin X and give the new version a unique name Y. This document refers to Y as a different storage plugin, although it is actually just a reconfiguration of original interface. When you execute a query, Drill gets the storage plugin name in one of several ways: +You can modify the default configuration X of a storage plugin and give the new configuration a unique name Y. This document refers to Y as a different storage plugin, although it is actually just a reconfiguration of original interface. When you execute a query, Drill gets the storage plugin configuration name in one of several ways: * The FROM clause of the query can identify the plugin to use. * The USE command can precede the query. http://git-wip-us.apache.org/repos/asf/drill/blob/1fc4d00c/_docs/connect-a-data-source/020-storage-plugin-registration.md ---------------------------------------------------------------------- diff --git a/_docs/connect-a-data-source/020-storage-plugin-registration.md b/_docs/connect-a-data-source/020-storage-plugin-registration.md index 73ff050..831f18d 100644 --- a/_docs/connect-a-data-source/020-storage-plugin-registration.md +++ b/_docs/connect-a-data-source/020-storage-plugin-registration.md @@ -2,11 +2,11 @@ title: "Storage Plugin Registration" parent: "Connect a Data Source" --- -You connect Drill to a file system, Hive, HBase, or other data source through a storage plugin. On the Storage tab of the Web UI, you can view and reconfigure a storage plugin. You can create a new name for the reconfigured version, thereby registering the new version. To open the Storage tab, go to `http://:8047/storage`, where IP address is any one of the installed Drillbits in a distributed system or `localhost` in an embedded system: +You connect Drill to a file system, Hive, HBase, or other data source through a storage plugin. On the Storage tab of the Web UI, you can view and reconfigure a storage plugin. You can create a new name for the reconfigured version, thereby registering the new version. To open the Storage tab, go to `http://:8047/storage`, where IP address is the host name or IP address of one of the installed Drillbits in a distributed system or `localhost` in an embedded system: ![drill-installed plugins]({{ site.baseurl }}/docs/img/plugin-default.png) -The Drill installation registers the the `cp`, `dfs`, `hbase`, `hive`, and `mongo` storage plugin configurations. +The Drill installation registers the `cp`, `dfs`, `hbase`, `hive`, and `mongo` default storage plugin configurations. * `cp` Points to a JAR file in the Drill classpath that contains the Transaction Processing Performance Council (TPC) benchmark schema TPC-H that you can query. @@ -20,7 +20,9 @@ point to any distributed file system, such as a Hadoop or S3 file system. * `mongo` Provides a connection to MongoDB data. -In the [Drill sandbox]({{site.baseurl}}/docs/about-the-mapr-sandbox/), the `dfs` storage plugin connects you to a simulation of a distributed file system. If you install Drill, `dfs` connects you to the root of your file system. +In the [Drill sandbox]({{site.baseurl}}/docs/about-the-mapr-sandbox/), the `dfs` storage plugin configuration connects you to a Hadoop environment pre-configured with Drill. If you install Drill, `dfs` connects you to the root of your file system. + +## Storage Plugin Configuration Persistance Drill saves storage plugin configurations in a temporary directory (embedded mode) or in ZooKeeper (distributed mode). The storage plugin configuration persists after upgrading, so a configuration that you created in one version of Drill appears in the Drill Web UI of an upgraded version of Drill. For example, on Mac OS X, Drill uses `/tmp/drill/sys.storage_plugins` to store storage plugin configurations. To revert to the default storage plugins for a particular version, in embedded mode, delete the contents of this directory and restart the Drill shell. http://git-wip-us.apache.org/repos/asf/drill/blob/1fc4d00c/_docs/connect-a-data-source/035-plugin-configuration-basics.md ---------------------------------------------------------------------- diff --git a/_docs/connect-a-data-source/035-plugin-configuration-basics.md b/_docs/connect-a-data-source/035-plugin-configuration-basics.md index 3319581..9eda181 100644 --- a/_docs/connect-a-data-source/035-plugin-configuration-basics.md +++ b/_docs/connect-a-data-source/035-plugin-configuration-basics.md @@ -2,12 +2,14 @@ title: "Plugin Configuration Basics" parent: "Storage Plugin Configuration" --- -When you add or update storage plugin instances on one Drill node in a +When you add or update storage plugin configurations on one Drill node in a cluster having multiple installations of Drill, Drill broadcasts the information to other Drill nodes to synchronize the storage plugin configurations. You do not need to -restart any of the Drillbits when you add or update a storage plugin instance. +restart any of the Drillbits when you add or update a storage plugin configuration. -Use the Drill Web UI to update or add a new storage plugin configuration. Launch a web browser, go to: `http://:8047`, and then go to the Storage tab. +## Using the Drill Web UI + +Use the Drill Web UI to update or add a new storage plugin configuration. The Drill shell needs to be running to access the Drill Web UI. To open the Drill Web UI, launch a web browser, and go to: `http://:8047` of any Drillbit in the cluster. Select the Storage tab to view, update, or add a new storage plugin configuration. To create a name and new configuration: @@ -46,7 +48,7 @@ The following table describes the attributes you configure for storage plugins i "connection" "classpath:///"
"file:///"
"mongodb://localhost:27017/"
"hdfs://" implementation-dependent - Type of distributed file system, such as HDFS, Amazon S3, or files in your file system. + The type of distributed file system, such as HDFS, Amazon S3, or files in your file system, and an address/path name. "workspaces" @@ -70,13 +72,13 @@ The following table describes the attributes you configure for storage plugins i "workspaces". . . "defaultInputFormat" null
"parquet"
"csv"
"json" no - Format for reading data, regardless of extension. Default = Parquet. + Format for reading data, regardless of extension. Default = "parquet" "formats" "psv"
"csv"
"tsv"
"parquet"
"json"
"avro"
"maprdb" * yes - One or more valid file formats for reading. Drill implicitly detects formats of some files based on extension or bits of data in the file, others require configuration. + One or more valid file formats for reading. Drill implicitly detects formats of some files based on extension or bits of data in the file; others require configuration. "formats" . . . "type" @@ -88,13 +90,13 @@ The following table describes the attributes you configure for storage plugins i formats . . . "extensions" ["csv"] format-dependent - Extensions of the files that Drill can read. + File name extensions that Drill can read. "formats" . . . "delimiter" "\t"
"," format-dependent - One or more characters that serve as a record seperator in a delimited text file, such as CSV. Use a 4-digit hex ascii code syntax \uXXXX for a non-printable delimiter. + Sequence of one or more characters that serve as a record separator in a delimited text file, such as CSV. Use a 4-digit hex code syntax \uXXXX for a non-printable delimiter. "formats" . . . "quote" @@ -148,7 +150,7 @@ Drill provides a REST API that you can use to create a storage plugin configurat The storage plugin configuration name. * config - The attribute settings as you would enter it in the Web UI. + The attribute settings as entered in the Web UI. For example, this command creates a storage plugin named myplugin for reading files of an unknown type located on the root of the file system: @@ -156,13 +158,13 @@ For example, this command creates a storage plugin named myplugin for reading fi ## Bootstrapping a Storage Plugin -If you need to add a storage plugin to Drill and do not want to use a web browser, you can create a [bootstrap-storage-plugins.json](https://github.com/apache/drill/blob/master/contrib/storage-hbase/src/main/resources/bootstrap-storage-plugins.json) file and include it on the classpath when starting Drill. The storage plugin loads when Drill starts up. +If you need to add a storage plugin configurationto Drill and do not want to use a web browser, you can create a [bootstrap-storage-plugins.json](https://github.com/apache/drill/blob/master/contrib/storage-hbase/src/main/resources/bootstrap-storage-plugins.json) file and include it on the classpath when starting Drill. The storage plugin configuration loads when Drill starts up. -Bootstrapping a storage plugin works only when the first drillbit in the cluster first starts up. The configuration is -stored in zookeeper, preventing Drill from picking up the boostrap-storage-plugins.json again. +Bootstrapping a storage plugin configuration works only when the first Drillbit in the cluster first starts up. The configuration is +stored in ZooKeeper, preventing Drill from picking up the bootstrap-storage-plugins.json again. After cluster startup, you have to use the REST API or Drill Web UI to add a storage plugin configuration. Alternatively, you -can modify the entry in zookeeper by uploading the json file for +can modify the entry in ZooKeeper by uploading the json file for that plugin to the /drill directory of the zookeeper installation, or by just deleting the /drill directory if you do not have configuration properties to preserve. -If you configure an HBase storage plugin using bootstrap-storage-plugins.json file and HBase is not installed, you might experience a delay when executing the queries. Configure the [HBase client timeout](http://hbase.apache.org/book.html#config.files) and retry settings in the config block of HBase plugin instance configuration. +If you load an HBase storage plugin configuration using bootstrap-storage-plugins.json file and HBase is not installed, you might experience a delay when executing the queries. Configure the [HBase client timeout](http://hbase.apache.org/book.html#config.files) and retry settings in the config block of the HBase plugin configuration. http://git-wip-us.apache.org/repos/asf/drill/blob/1fc4d00c/_docs/connect-a-data-source/040-file-system-storage-plugin.md ---------------------------------------------------------------------- diff --git a/_docs/connect-a-data-source/040-file-system-storage-plugin.md b/_docs/connect-a-data-source/040-file-system-storage-plugin.md index 7d299d2..3380d28 100644 --- a/_docs/connect-a-data-source/040-file-system-storage-plugin.md +++ b/_docs/connect-a-data-source/040-file-system-storage-plugin.md @@ -2,63 +2,71 @@ title: "File System Storage Plugin" parent: "Storage Plugin Configuration" --- -You can register a storage plugin instance that connects Drill to a local file system or to a distributed file system registered in `core-site.xml`, such as S3 +You can register a storage plugin configuration that connects Drill to a local file system or to a distributed file system registered in the Hadoop `core-site.xml`, such as S3 or HDFS. By -default, Apache Drill includes an storage plugin named `dfs` that points to the local file +default, Apache Drill includes a storage plugin configuration named `dfs` that points to the local file system on your machine by default. ## Connecting Drill to a File System -In a Drill cluster, you typically do not query the local file system, but instead place files on the distributed file system. You configure the connection property of the storage plugin workspace to connect Drill to a distributed file system. For example, the following connection properties connect Drill to an HDFS cluster from a client: +In a Drill cluster, you typically do not query the local file system, but instead place files on the distributed file system. You configure the connection property of the storage plugin workspace to connect Drill to a distributed file system. For example, the following connection property connects Drill to an HDFS cluster from a client: `"connection": "hdfs://:/"` -To query a file on HDFS from a node on the cluster, you can simply change the connection to from `file:///` to `hdfs://` in the `dfs` storage plugin. +To query a file on HDFS from a node on the cluster, you can simply change the connection from `file:///` to `hdfs://` in the `dfs` storage plugin. + +To change the `dfs` storage plugin configuration to point to a different local or a distributed file system, use `connection` attributes as shown in the following examples. -To change the `dfs` storage plugin configuration to point to a local or a distributed file system, use `connection` attributes as shown in the following example. * Local file system example: - { - "type": "file", - "enabled": true, - "connection": "file:///", - "workspaces": { - "root": { - "location": "/user/max/donuts", - "writable": false, - "defaultInputFormat": null - } - }, - "formats" : { - "json" : { - "type" : "json" - } - } + ``` + { + "type": "file", + "enabled": true, + "connection": "file:///", + "workspaces": { + "root": { + "location": "/user/max/donuts", + "writable": false, + "defaultInputFormat": null + } + }, + "formats" : { + "json" : { + "type" : "json" } + } + } + ``` + * Distributed file system example: - - { - "type" : "file", - "enabled" : true, - "connection" : "hdfs://10.10.30.156:8020/", - "workspaces" : { - "root" : { - "location" : "/user/root/drill", - "writable" : true, - "defaultInputFormat" : null - } - }, - "formats" : { - "json" : { - "type" : "json" - } + + ``` + { + "type" : "file", + "enabled" : true, + "connection" : "hdfs://10.10.30.156:8020/", + "workspaces" : { + "root" : { + "location" : "/user/root/drill", + "writable" : true, + "defaultInputFormat" : null + } + }, + "formats" : { + "json" : { + "type" : "json" } } + } + ``` + +To connect to a Hadoop file system, you include the IP address and port number of the +name node. -To connect to a Hadoop file system, you include the IP address of the -name node and the port number. +### Querying Donuts Example -The following example shows an file type storage plugin configuration with a +The following example shows a file type storage plugin configuration with a workspace named `json_files`. The configuration points Drill to the `/users/max/drill/json/` directory in the local file system `(dfs)`: @@ -74,18 +82,16 @@ workspace named `json_files`. The configuration points Drill to the } }, -The `connection` parameter in this configuration is "`file:///`", connecting Drill to the local file system (`dfs`). +The `connection` parameter in this configuration is "`file:///`", connecting Drill to the local file system. To query a file in the example `json_files` workspace, you can issue the `USE` command to tell Drill to use the `json_files` workspace configured in the `dfs` instance for each query that you issue: -**Example** - USE dfs.json_files; - SELECT * FROM dfs.json_files.`donuts.json` WHERE type='frosted' + SELECT * FROM `donuts.json` WHERE type='frosted' If the `json_files` workspace did not exist, the query would have to include the -full path to the `donuts.json` file: +full file path name to the `donuts.json` file: SELECT * FROM dfs.`/users/max/drill/json/donuts.json` WHERE type='frosted'; \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/1fc4d00c/_docs/connect-a-data-source/050-workspaces.md ---------------------------------------------------------------------- diff --git a/_docs/connect-a-data-source/050-workspaces.md b/_docs/connect-a-data-source/050-workspaces.md index b535267..258e3fd 100644 --- a/_docs/connect-a-data-source/050-workspaces.md +++ b/_docs/connect-a-data-source/050-workspaces.md @@ -2,26 +2,28 @@ title: "Workspaces" parent: "Storage Plugin Configuration" --- -You can define one or more workspaces in a storage plugin configuration. The workspace defines the directory location of files in a local or distributed file system. Drill searches the workspace to locate data when +You can define one or more workspaces in a storage plugin configuration. The workspace defines the location of files in subdirectories of a local or distributed file system. Drill searches the workspace to locate data when you run a query. The `default` workspace points to the root of the file system. -Configuring `workspaces` to include a file location simplifies the query, which is important when querying the same data source repeatedly. After you configure a long path name in the workspaces location property, instead of -using the full path to the data source, you use dot notation in the FROM +Configuring workspaces to include a subdirectory simplifies the query, which is important when querying the same files repeatedly. After you configure a long path name in the workspace `location` property, instead of +using the full path name to the data source, you use dot notation in the FROM clause. -``.```` +``.```` -To query the data source while you are not *using* that storage plugin, include the plugin name. This syntax assumes you did not issue a USE statement to connect to a storage plugin that defines the +Where `` is the path name of a subdirectory, such as `/users/max/drill/json` enclosed in double quotation marks as shown in the ["Querying Donuts Example."](/docs/file-system-storage-plugin/#querying-donuts-example) + +To query the data source when you have not set the default schema name to the storage plugin configuration, include the plugin name. This syntax assumes you did not issue a USE statement to connect to a storage plugin that defines the location of the data: -``..```` +``..```` ## No Workspaces for Hive and HBase -You cannot configure workspaces for -`hive` and `hbase`, though Hive databases show up as workspaces in +You cannot include workspaces in the configurations of the +`hive` and `hbase` plugins installed with Apache Drill, though Hive databases show up as workspaces in Drill. Each `hive` instance includes a `default` workspace that points to the Hive metastore. When you query files and tables in the `hive default` workspaces, you can omit the workspace name from the query. @@ -34,9 +36,9 @@ using either of the following queries and get the same results: SELECT * FROM hive.customers LIMIT 10; SELECT * FROM hive.`default`.customers LIMIT 10; -{% include startnote.html %}Default is a reserved word. You must enclose reserved words in back ticks.{% include endnote.html %} +{% include startnote.html %}Default is a reserved word. You must enclose reserved words when used as identifiers in back ticks.{% include endnote.html %} -Because the HBase storage plugin configuration does not have a workspace, you can use the following +Because the HBase storage plugin does not accommodate a workspace, you can use the following query: SELECT * FROM hbase.customers LIMIT 10; http://git-wip-us.apache.org/repos/asf/drill/blob/1fc4d00c/_docs/connect-a-data-source/060-hbase-storage-plugin.md ---------------------------------------------------------------------- diff --git a/_docs/connect-a-data-source/060-hbase-storage-plugin.md b/_docs/connect-a-data-source/060-hbase-storage-plugin.md index 488a564..d97feab 100644 --- a/_docs/connect-a-data-source/060-hbase-storage-plugin.md +++ b/_docs/connect-a-data-source/060-hbase-storage-plugin.md @@ -2,12 +2,9 @@ title: "HBase Storage Plugin" parent: "Storage Plugin Configuration" --- -Specify a ZooKeeper quorum to connect -Drill to an HBase data source. Drill supports HBase version 0.98. +When connecting Drill to an HBase data source using the HBase storage plugin installed with Drill, you need to specify a ZooKeeper quorum. Drill supports HBase version 0.98. -To HBase storage plugin configuration installed with Drill appears as follows when you navigate to [http://localhost:8047](http://localhost:8047/), and select the **Storage** tab. - - **Example** +To view or change the HBase storage plugin configuration, use the [Drill Web UI]({{ site.baseurl }}/docs/plugin-configuration-basics/#using-the-drill-web-ui). In the Web UI, select the **Storage** tab, and then click the **Update** button for the `hbase` storage plugin configuration. The following example shows a typical HBase storage plugin: { "type": "hbase", http://git-wip-us.apache.org/repos/asf/drill/blob/1fc4d00c/_docs/connect-a-data-source/070-hive-storage-plugin.md ---------------------------------------------------------------------- diff --git a/_docs/connect-a-data-source/070-hive-storage-plugin.md b/_docs/connect-a-data-source/070-hive-storage-plugin.md index c7ab31f..83b9e43 100644 --- a/_docs/connect-a-data-source/070-hive-storage-plugin.md +++ b/_docs/connect-a-data-source/070-hive-storage-plugin.md @@ -7,22 +7,22 @@ using custom SerDes or InputFormat/OutputFormat, all nodes running Drillbits must have the SerDes or InputFormat/OutputFormat `JAR` files in the `/jars/3rdparty` folder. -## Hive Remote Metastore +## Hive Remote Metastore Configuration -In this configuration, the Hive metastore runs as a separate service outside +The Hive metastore configuration runs as a separate service outside of Hive. Drill communicates with the Hive metastore through Thrift. The metastore service communicates with the Hive database over JDBC. Point Drill to the Hive metastore service address, and provide the connection parameters -in the Drill Web UI to configure a connection to Drill. +in a Hive storage plugin configuration to configure a connection to Drill. {% include startnote.html %}Verify that the Hive metastore service is running before you register the Hive metastore.{% include endnote.html %} -To configure a remote Hive metastore, complete the following steps: +To register a remote Hive metastore with Drill: 1. Issue the following command to start the Hive metastore service on the system specified in the `hive.metastore.uris`: `hive --service metastore` -2. Navigate to `http://:8047`, and select the **Storage** tab. -3. In the disabled storage plugins section, click **Update** next to the `hive` instance. +2. In the [Drill Web UI]({{ site.baseurl }}/docs/plugin-configuration-basics/#using-the-drill-web-ui), select the **Storage** tab. +3. In the list of disabled storage plugins in the Drill Web UI, click **Update** next to the `hive` instance. For example: { "type": "hive", @@ -35,15 +35,13 @@ To configure a remote Hive metastore, complete the following steps: "hive.metastore.sasl.enabled": "false" } } -4. In the configuration window, add the `Thrift URI` and port to `hive.metastore.uris`. +4. In the configuration window, add the `Thrift URI` and port to `hive.metastore.uris`. For example: - **Example** - ... "configProps": { "hive.metastore.uris": "thrift://:", ... -5. Change the default location of files to suit your environment, for example, change `"fs.default.name": "file:///"` to one of these locations: +5. Change the default location of files to suit your environment; for example, change `"fs.default.name"` property from `"file:///"` to one of these locations: * `hdfs://` * `hdfs://:` 6. If you are running Drill and Hive in a secure MapR cluster, remove the following line from the configuration: @@ -54,9 +52,9 @@ To configure a remote Hive metastore, complete the following steps: After configuring a Hive storage plugin, you can [query Hive tables]({{ site.baseurl }}/docs/querying-hive/). -## Hive Embedded Metastore +## Hive Embedded Metastore Configuration -In this configuration, the Hive metastore is embedded within the Drill process. Configure an embedded metastore only in a cluster that runs a single Drillbit and only for testing purposes. Do not embed the Hive metastore in production systems. +The Hive metastore configuration is embedded within the Drill process. Configure an embedded metastore only in a cluster that runs a single Drillbit and only for testing purposes. Do not embed the Hive metastore in production systems. Provide the metastore database configuration settings in the Drill Web UI. Before you configure an embedded Hive metastore, verify that the driver you use to connect to the Hive metastore is in the Drill classpath located in `//lib/.` If the driver is not there, copy the driver to `//lib` on the Drill node. For more information about storage types and configurations, refer to ["Hive Metastore Administration"](https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin). @@ -64,7 +62,7 @@ installation directory>/lib` on the Drill node. For more information about stora To configure an embedded Hive metastore, complete the following steps: -1. Navigate to `http://:8047`, and select the **Storage** tab. +1. In the [Drill Web UI]({{ site.baseurl }}/docs/plugin-configuration-basics/#using-the-drill-web-ui), and select the **Storage** tab. 2. In the disabled storage plugins section, click **Update** next to `hive` instance. 3. In the configuration window, add the database configuration settings. @@ -81,6 +79,6 @@ steps: "hive.metastore.sasl.enabled": "false" } } -5. Change the `"fs.default.name":` attribute to specify the default location of files. The value needs to be a URI that is available and capable of handling file system requests. For example, change the local file system URI `"file:///"` to the HDFS URI: `hdfs://`, or to the path on HDFS with a namenode: `hdfs://:` +5. Change the `"fs.default.name"` attribute to specify the default location of files. The value needs to be a URI that is available and capable of handling file system requests. For example, change the local file system URI `"file:///"` to the HDFS URI: `hdfs://`, or to the path on HDFS with a namenode: `hdfs://:` 6. Click **Enable**. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/1fc4d00c/_docs/connect-a-data-source/080-drill-default-input-format.md ---------------------------------------------------------------------- diff --git a/_docs/connect-a-data-source/080-drill-default-input-format.md b/_docs/connect-a-data-source/080-drill-default-input-format.md index 960512d..fd6b768 100644 --- a/_docs/connect-a-data-source/080-drill-default-input-format.md +++ b/_docs/connect-a-data-source/080-drill-default-input-format.md @@ -3,62 +3,47 @@ title: "Drill Default Input Format" parent: "Storage Plugin Configuration" --- You can define a default input format to tell Drill what file type exists in a -workspace within a file system. Drill determines the file type based on file -extensions and magic numbers when searching a workspace. +workspace within a file system. -Magic numbers are file signatures that Drill uses to identify Parquet files. -If Drill cannot identify the file type based on file extensions or magic +Normally, Drill determines the file type based on file +extensions and *magic numbers* when searching a workspace. Magic numbers are file signatures that Drill uses to identify Parquet files. If Drill cannot identify the file type based on file extensions or magic numbers, the query fails. Defining a default input format can prevent queries from failing in situations where Drill cannot determine the file type. -If you incorrectly define the file type in a workspace and Drill cannot -determine the file type, the query fails. For example, if JSON files do not have a `.json` extension, the query fails. - -You can define one default input format per workspace. If you do not define a -default input format, and Drill cannot detect the file format, the query -fails. You can define a default input format for any of the file types that -Drill supports. Currently, Drill supports the following types: +If you do not define the default file type in a workspace or incorrectly define the default file type, and Drill cannot +determine the file type without this information, the query fails. You can define one default input format per workspace. You can define a default input format for any of the file types that +Drill supports. Currently, Drill supports the following input types: * Avro * CSV, TSV, or PSV * Parquet * JSON - * MapR-DB* - -\* Only available when you install Drill on a cluster using the mapr-drill package. - -## Defining a Default Input Format -You define the default input format for a file system workspace through the -Drill Web UI. You must have a [defined workspace]({{ site.baseurl }}/docs/workspaces) before you can define a -default input format. +You must have a [defined workspace]({{ site.baseurl }}/docs/workspaces) before you can define a default input format. -To define a default input format for a workspace, complete the following -steps: +To define a default input format for a workspace: - 1. Navigate to the Drill Web UI at `:8047`. The Drillbit process must be running on the node before you connect to the Drill Web UI. + 1. Navigate to the [Drill Web UI]({{ site.baseurl }}/docs/plugin-configuration-basics/#using-the-drill-web-ui). The Drillbit process must be running on the node before you connect to the Drill Web UI. 2. Select **Storage** in the toolbar. - 3. Click **Update** next to the storage plugin for which you want to define a default input format for a workspace. + 3. Click **Update** next to the storage plugin configuration for which you want to define a default input format for a workspace. 4. In the Configuration area, locate the workspace, and change the `defaultInputFormat` attribute to any of the supported file types. - **Example** - - { - "type": "file", - "enabled": true, - "connection": "hdfs://", - "workspaces": { - "root": { - "location": "/drill/testdata", - "writable": false, - "defaultInputFormat": csv - }, - "local" : { - "location" : "/max/proddata", - "writable" : true, - "defaultInputFormat" : "json" - } - -## Querying Compressed Files - -You can query compressed GZ files, such as JSON and CSV, as well as uncompressed files. The file extension specified in the `formats . . . extensions` property of the storage plugin configuration must precede the gz extension in the file name. For example, `proddata.json.gz` or `mydata.csv.gz` are valid file names to use in a query, as shown in the example in ["Querying the GZ File Directly"]({{site.baseurl"}}/docs/querying-plain-text-files/#query-the-gz-file-directly). +### Example of Defining a Default Input Format + +``` +{ + "type": "file", + "enabled": true, + "connection": "hdfs://", + "workspaces": { + "root": { + "location": "/drill/testdata", + "writable": false, + "defaultInputFormat": "csv" + }, + "local" : { + "location" : "/max/proddata", + "writable" : true, + "defaultInputFormat" : "json" +} +``` \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/1fc4d00c/_docs/connect-a-data-source/090-mongodb-plugin-for-apache-drill.md ---------------------------------------------------------------------- diff --git a/_docs/connect-a-data-source/090-mongodb-plugin-for-apache-drill.md b/_docs/connect-a-data-source/090-mongodb-plugin-for-apache-drill.md index afd6ee2..7e439e2 100644 --- a/_docs/connect-a-data-source/090-mongodb-plugin-for-apache-drill.md +++ b/_docs/connect-a-data-source/090-mongodb-plugin-for-apache-drill.md @@ -4,18 +4,16 @@ parent: "Connect a Data Source" --- ## Overview -Drill supports MongoDB 3.0, providing a mongodb format plugin to connect to MongoDB using MongoDB's latest Java driver. You can run queries -to read, but not write, the Mongo data using Drill. Attempting to write data back to Mongo results in an error. You do not need any upfront schema definitions. +Drill supports MongoDB 3.0, providing a mongodb storage plugin to connect to MongoDB using MongoDB's latest Java driver. You can run queries +to read, but not write, Mongo data using Drill. Attempting to write data back to Mongo results in an error. You do not need any upfront schema definitions. -{% include startnote.html %}A local instance of Drill is used in this tutorial for simplicity. {% include endnote.html %} +{% include startnote.html %}In the following examples, you use a local instance of Drill for simplicity. {% include endnote.html %} You can also run Drill and MongoDB together in distributed mode. ### Before You Begin -Before you can query MongoDB with Drill, you must have Drill and MongoDB -installed on your machine. Examples in this tutorial use zip code aggregation data -provided by MongoDB that you download in the following steps: +To query MongoDB with Drill, you install Drill and MongoDB, and then you import zip code aggregation data into MongoDB. 1. [Install Drill]({{ site.baseurl }}/docs/installing-drill-in-embedded-mode), if you do not already have it installed. 2. [Install MongoDB](http://docs.mongodb.org/manual/installation), if you do not already have it installed. @@ -23,20 +21,14 @@ provided by MongoDB that you download in the following steps: ## Configuring MongoDB -Start Drill and configure the MongoDB storage plugin in the Drill Web -UI to connect to Drill. Drill must be running in order to access the Web UI. - -Complete the following steps to configure MongoDB as a data source for Drill: +Drill must be running in order to access the Web UI to configure a storage plugin configuration. Start Drill and view and enable the MongoDB storage plugin configuration as described in the following procedure: 1. [Start the Drill shell]({{site.baseurl}}/docs/starting-drill-on-linux-and-mac-os-x/). The Drill shell needs to be running to access the Drill Web UI. - 2. Open a browser window, and navigate to the Drill Web UI at `http://localhost:8047`. - 3. In the navigation bar, click **Storage**. - 4. Under Disabled Storage Plugins, select **Update** next to the `mongo` storage plugin. - 5. In the Configuration window, verify that `"enabled"` is set to ``"true."`` - - **Example** + 2. In the [Drill Web UI]({{ site.baseurl }}/docs/plugin-configuration-basics/#using-the-drill-web-ui), select the **Storage** tab. + 4. Under Disabled Storage Plugins, select **Update** to choose the `mongo` storage plugin configuration. + 5. In the Configuration window, take a look at the default configuration: { "type": "mongo", @@ -49,7 +41,7 @@ Complete the following steps to configure MongoDB as a data source for Drill: ## Querying MongoDB -In the Drill shell, you can issue the `SHOW DATABASES `command to see a list of databases from all +In the [Drill shell]({{site.baseurl}}/docs/starting-drill-on-linux-and-mac-os-x/), you can issue the `SHOW DATABASES` command to see a list of schemas from all Drill data sources, including MongoDB. If you downloaded the zip codes file, you should see `mongo.zipdb` in the results. @@ -66,16 +58,11 @@ you should see `mongo.zipdb` in the results. | INFORMATION_SCHEMA | +--------------------+ -If you want all queries that you submit to run on `mongo.zipdb`, you can issue +If you want all queries that you submit to default to `mongo.zipdb`, you can issue the `USE` command to change schema. ### Example Queries -The following example queries are included for reference. However, you can use -the SQL power of Apache Drill directly on MongoDB. For more information about, -refer to the [SQL -Reference]({{ site.baseurl }}/docs/sql-reference). - **Example 1: View mongo.zipdb Dataset** 0: jdbc:drill:zk=local> SELECT * FROM zipcodes LIMIT 10; @@ -147,7 +134,5 @@ Reference]({{ site.baseurl }}/docs/sql-reference). ## Using ODBC/JDBC Drivers -You can leverage the power of Apache Drill to query MongoDB through standard -BI tools, such as Tableau and SQuirreL. - -For information about Drill ODBC and JDBC drivers, refer to [Drill Interfaces]({{ site.baseurl }}/docs/odbc-jdbc-interfaces). +You can query MongoDB through standard +BI tools, such as Tableau and SQuirreL. For information about Drill ODBC and JDBC drivers, refer to [Drill Interfaces]({{ site.baseurl }}/docs/odbc-jdbc-interfaces). http://git-wip-us.apache.org/repos/asf/drill/blob/1fc4d00c/_docs/data-sources-and-file-formats/050-json-data-model.md ---------------------------------------------------------------------- diff --git a/_docs/data-sources-and-file-formats/050-json-data-model.md b/_docs/data-sources-and-file-formats/050-json-data-model.md index 75f47b1..59618f8 100644 --- a/_docs/data-sources-and-file-formats/050-json-data-model.md +++ b/_docs/data-sources-and-file-formats/050-json-data-model.md @@ -12,7 +12,7 @@ Semi-structured JSON data often consists of complex, nested elements having sche Using Drill you can natively query dynamic JSON data sets using SQL. Drill treats a JSON object as a SQL record. One object equals one row in a Drill table. -You can also [query compressed .gz files]({{ site.baseurl }}/docs/drill-default-input-format#querying-compressed-json) having JSON as well as uncompressed .json files. +You can also [query compressed .gz files]({{ site.baseurl }}/docs/querying-plain-text-files/#querying-compressed-files) having JSON as well as uncompressed .json files. In addition to the examples presented later in this section, see ["How to Analyze Highly Dynamic Datasets with Apache Drill"](https://www.mapr.com/blog/how-analyze-highly-dynamic-datasets-apache-drill) for information about how to analyze a JSON data set. http://git-wip-us.apache.org/repos/asf/drill/blob/1fc4d00c/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md ---------------------------------------------------------------------- diff --git a/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md b/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md index c17ac33..07e1e03 100644 --- a/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md +++ b/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md @@ -196,8 +196,11 @@ times a year in the books that Google scans. The Drill default storage plugins support common file formats. +## Querying Compressed Files -## Query the GZ File Directly +You can query compressed GZ files, such as JSON and CSV, as well as uncompressed files. The file extension specified in the `formats . . . extensions` property of the storage plugin configuration must precede the gz extension in the file name. For example, `proddata.json.gz` or `mydata.csv.gz` are valid file names to use in a query, as shown in the next example. + +### Query the GZ File Directly This example covers how to query the GZ file containing the compressed TSV data. The GZ file name needs to be renamed to specify the type of delimited file, such as CSV or TSV. You add `.tsv` before the `.gz` extension in this example. @@ -214,3 +217,5 @@ This example covers how to query the GZ file containing the compressed TSV data. The 5 rows of output appear. + + http://git-wip-us.apache.org/repos/asf/drill/blob/1fc4d00c/_docs/tutorials/learn-drill-with-the-mapr-sandbox/005-about-the-mapr-sandbox.md ---------------------------------------------------------------------- diff --git a/_docs/tutorials/learn-drill-with-the-mapr-sandbox/005-about-the-mapr-sandbox.md b/_docs/tutorials/learn-drill-with-the-mapr-sandbox/005-about-the-mapr-sandbox.md index c1b2376..01bede8 100644 --- a/_docs/tutorials/learn-drill-with-the-mapr-sandbox/005-about-the-mapr-sandbox.md +++ b/_docs/tutorials/learn-drill-with-the-mapr-sandbox/005-about-the-mapr-sandbox.md @@ -2,12 +2,11 @@ title: "About the MapR Sandbox" parent: "Learn Drill with the MapR Sandbox" --- -This tutorial uses the MapR Sandbox, which is a Hadoop environment pre- -configured with Apache Drill. MapR includes Apache Drill as part of the Hadoop distribution. The MapR -Sandbox with Apache Drill is a fully functional single-node cluster that can -be used to get an overview on Apache Drill in a Hadoop environment. Business +This tutorial uses the MapR Sandbox, which is a Hadoop environment pre-configured with Drill. MapR includes Drill as part of the Hadoop distribution. The MapR +Sandbox with Drill is a fully functional single-node cluster that can +be used to get an overview of Drill in a Hadoop environment. Business and technical analysts, product managers, and developers can use the sandbox -environment to get a feel for the power and capabilities of Apache Drill by +environment to get a feel for the power and capabilities of Drill by performing various types of queries. Hadoop is not a prerequisite for Drill and users can start ramping