Return-Path: X-Original-To: apmail-drill-commits-archive@www.apache.org Delivered-To: apmail-drill-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 433B217B88 for ; Thu, 26 Feb 2015 01:01:16 +0000 (UTC) Received: (qmail 8338 invoked by uid 500); 26 Feb 2015 01:01:11 -0000 Delivered-To: apmail-drill-commits-archive@drill.apache.org Received: (qmail 8258 invoked by uid 500); 26 Feb 2015 01:01:11 -0000 Mailing-List: contact commits-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: commits@drill.apache.org Delivered-To: mailing list commits@drill.apache.org Received: (qmail 7910 invoked by uid 99); 26 Feb 2015 01:01:10 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Feb 2015 01:01:10 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 41F56E0E79; Thu, 26 Feb 2015 01:01:10 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: adi@apache.org To: commits@drill.apache.org Date: Thu, 26 Feb 2015 01:01:21 -0000 Message-Id: In-Reply-To: <3516a4a1c7064c0ba0a08c8a22492e3f@git.apache.org> References: <3516a4a1c7064c0ba0a08c8a22492e3f@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [12/13] drill git commit: DRILL-2315: Confluence conversion plus fixes http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/arch/001-core-mod.md ---------------------------------------------------------------------- diff --git a/_docs/arch/001-core-mod.md b/_docs/arch/001-core-mod.md new file mode 100644 index 0000000..17fa18d --- /dev/null +++ b/_docs/arch/001-core-mod.md @@ -0,0 +1,29 @@ +--- +title: "Core Modules within a Drillbit" +parent: "Architectural Overview" +--- +The following image represents components within each Drillbit: + +![drill query flow]({{ site.baseurl }}/docs/img/DrillbitModules.png) + +The following list describes the key components of a Drillbit: + + * **RPC end point**: Drill exposes a low overhead protobuf-based RPC protocol to communicate with the clients. Additionally, a C++ and Java API layers are also available for the client applications to interact with Drill. Clients can communicate to a specific Drillbit directly or go through a ZooKeeper quorum to discover the available Drillbits before submitting queries. It is recommended that the clients always go through ZooKeeper to shield clients from the intricacies of cluster management, such as the addition or removal of nodes. + + * **SQL parser**: Drill uses Optiq, the open source framework, to parse incoming queries. The output of the parser component is a language agnostic, computer-friendly logical plan that represents the query. + * **Storage plugin interfaces**: Drill serves as a query layer on top of several data sources. Storage plugins in Drill represent the abstractions that Drill uses to interact with the data sources. Storage plugins provide Drill with the following information: + * Metadata available in the source + * Interfaces for Drill to read from and write to data sources + * Location of data and a set of optimization rules to help with efficient and faster execution of Drill queries on a specific data source + + In the context of Hadoop, Drill provides storage plugins for files and +HBase/M7. Drill also integrates with Hive as a storage plugin since Hive +provides a metadata abstraction layer on top of files, HBase/M7, and provides +libraries to read data and operate on these sources (Serdes and UDFs). + + When users query files and HBase/M7 with Drill, they can do it directly or go +through Hive if they have metadata defined there. Drill integration with Hive +is only for metadata. Drill does not invoke the Hive execution engine for any +requests. + + * **Distributed cache**: Drill uses a distributed cache to manage metadata (not the data) and configuration information across various nodes. Sample metadata information that is stored in the cache includes query plan fragments, intermediate state of the query execution, and statistics. Drill uses Infinispan as its cache technology. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/arch/002-arch-hilite.md ---------------------------------------------------------------------- diff --git a/_docs/arch/002-arch-hilite.md b/_docs/arch/002-arch-hilite.md new file mode 100644 index 0000000..5ac51bc --- /dev/null +++ b/_docs/arch/002-arch-hilite.md @@ -0,0 +1,10 @@ +--- +title: "Architectural Highlights" +parent: "Architectural Overview" +--- +The goal for Drill is to bring the **SQL Ecosystem** and **Performance** of +the relational systems to **Hadoop scale** data **WITHOUT** compromising on +the **Flexibility** of Hadoop/NoSQL systems. There are several core +architectural elements in Apache Drill that make it a highly flexible and +efficient query engine. + http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/arch/arch-hilite/001-flexibility.md ---------------------------------------------------------------------- diff --git a/_docs/arch/arch-hilite/001-flexibility.md b/_docs/arch/arch-hilite/001-flexibility.md new file mode 100644 index 0000000..0b5c5e3 --- /dev/null +++ b/_docs/arch/arch-hilite/001-flexibility.md @@ -0,0 +1,78 @@ +--- +title: "Flexibility" +parent: "Architectural Highlights" +--- +The following features contribute to Drill's flexible architecture: + +**_Dynamic schema discovery_** + +Drill does not require schema or type specification for the data in order to +start the query execution process. Instead, Drill starts processing the data +in units called record-batches and discovers the schema on the fly during +processing. Self-describing data formats such as Parquet, JSON, AVRO, and +NoSQL databases have schema specified as part of the data itself, which Drill +leverages dynamically at query time. Schema can change over the course of a +Drill query, so all of the Drill operators are designed to reconfigure +themselves when such schema changing events occur. + +**_Flexible data model_** + +Drill is purpose-built from the ground up for complex/multi-structured data +commonly seen in Hadoop/NoSQL applications such as social/mobile, clickstream, +logs, and sensor equipped IOT. From a user point of view, Drill allows access +to nested data attributes, just like SQL columns, and provides intuitive +extensions to easily operate on them. From an architectural point of view, +Drill provides a flexible hierarchical columnar data model that can represent +complex, highly dynamic and evolving data models, and allows for efficient +processing of it without the need to flatten or materialize it at design time +or at execution time. Relational data in Drill is treated as a special or +simplified case of complex/multi-structured data. + +**_De-centralized metadata_** + +Unlike other SQL-on-Hadoop technologies or any traditional relational +database, Drill does not have a centralized metadata requirement. In order to +query data through Drill, users do not need to create and manage tables and +views in a metadata repository, or rely on a database administrator group for +such a function. + +Drill metadata is derived from the storage plugins that correspond to data +sources. Drill supports a varied set of storage plugins that provide a +spectrum of metadata ranging from full metadata such as for Hive, partial +metadata such as for HBase, or no central metadata such as for files. + +De-centralized metadata also means that Drill is NOT tied to a single Hive +repository. Users can query multiple Hive repositories at once and then +combine the data with information from HBase tables or with a file in a +distributed file system. + +Users also have the ability to create metadata (tables/views/databases) within +Drill using the SQL DDL syntax. De-centralized metadata is applicable during +metadata creation. Drill allows persisting metadata in one of the underlying +data sources. + +From a client access perspective, Drill metadata is organized just like a +traditional DB (Databases->Tables/Views->Columns). The metadata is accessible +through the ANSI standard INFORMATION_SCHEMA database + +For more information on how to configure and work various data sources with +Drill, refer to [Connect Apache Drill to Data Sources](/drill/docs/connect-to-data-sources). + +**_Extensibility_** + +Drill provides an extensible architecture at all layers, including the storage +plugin, query, query optimization/execution, and client API layers. You can +customize any layer for the specific needs of an organization or you can +extend the layer to a broader array of use cases. + +Drill provides a built in classpath scanning and plugin concept to add +additional storage plugins, functions, and operators with minimal +configuration. + +The following list provides a few examples of Drill’s extensible architectural +capabilities: + +* A high performance Java API to implement custom UDFs/UDAFs +* Ability to go beyond Hadoop by implementing custom storage plugins to other data sources such as Oracle/MySQL or NoSQL stores, such as Mongo or Cassandra +* An API to implement custom operators +* Support for direct execution of strongly specified JSON based logical and physical plans to help with the simplification of testing, and to enable integration of alternative query languages other than SQL. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/arch/arch-hilite/002-performance.md ---------------------------------------------------------------------- diff --git a/_docs/arch/arch-hilite/002-performance.md b/_docs/arch/arch-hilite/002-performance.md new file mode 100644 index 0000000..c6271e0 --- /dev/null +++ b/_docs/arch/arch-hilite/002-performance.md @@ -0,0 +1,55 @@ +--- +title: "Performance" +parent: "Architectural Highlights" +--- +Drill is designed from the ground up for high performance on large datasets. +The following core elements of Drill processing are responsible for Drill's +performance: + +**_Distributed engine_** + +Drill provides a powerful distributed execution engine for processing queries. +Users can submit requests to any node in the cluster. You can simply add new +nodes to the cluster to scale for larger volumes of data, support more users +or to improve performance. + +**_Columnar execution_** + +Drill optimizes for both columnar storage and execution by using an in-memory +data model that is hierarchical and columnar. When working with data stored in +columnar formats such as Parquet, Drill avoids disk access for columns that +are not involved in an analytic query. Drill also provides an execution layer +that performs SQL processing directly on columnar data without row +materialization. The combination of optimizations for columnar storage and +direct columnar execution significantly lowers memory footprints and provides +faster execution of BI/Analytic type of workloads. + +**_Vectorization_** + +Rather than operating on single values from a single table record at one time, +vectorization in Drill allows the CPU to operate on vectors, referred to as a +Record Batches. Record Batches are arrays of values from many different +records. The technical basis for efficiency of vectorized processing is modern +chip technology with deep-pipelined CPU designs. Keeping all pipelines full to +achieve efficiency near peak performance is something impossible to achieve in +traditional database engines, primarily due to code complexity. + +**_Runtime compilation_** + +Runtime compilation is faster compared to the interpreted execution. Drill +generates highly efficient custom code for every single query for every single +operator. Here is a quick overview of the Drill compilation/code generation +process at a glance. + +![drill compiler]({{ site.baseurl }}/docs/img/58.png) + +**Optimistic and pipelined query execution** + +Drill adopts an optimistic execution model to process queries. Drill assumes +that failures are infrequent within the short span of a query and therefore +does not spend time creating boundaries or checkpoints to minimize recovery +time. Failures at node level are handled gracefully. In the instance of a +single query failure, the query is rerun. Drill execution uses a pipeline +model where all tasks are scheduled at once. The query execution happens in- +memory as much as possible to move data through task pipelines, persisting to +disk only if there is memory overflow. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/archive/001-how-to-demo.md ---------------------------------------------------------------------- diff --git a/_docs/archive/001-how-to-demo.md b/_docs/archive/001-how-to-demo.md new file mode 100644 index 0000000..1b46e88 --- /dev/null +++ b/_docs/archive/001-how-to-demo.md @@ -0,0 +1,309 @@ +--- +title: "How to Run the Drill Demo" +parent: "Archived Pages" +--- +# How To Run the Drill Demo +This section describes how to get started by running the Drill demo. + +## Pre-requisites + + * Maven 2 or higher + + On Ubuntu, you can do this as root: + + apt-get install maven2 + + On the Mac, maven is pre-installed. + + Note that installing maven can result in installing java 1.6 and setting that +to your default version. Make sure you check java version before compiling or +running. + + * Java 1.7 + + You will need java 1.7 to compile and run the Drill demo. + + On Ubuntu you can get the right version of Java by doing this as root: + + apt-get install openjdk-7-jdk + sudo update-alternatives --set java $(update-alternatives --list java | grep 7 | head -1) + + On a Mac, go to [Oracle's web- +site](http://www.oracle.com/technetwork/java/javase/downloads/java-se- +jdk-7-download-432154.html) to download and install java 7. You will also need +to set JAVA_HOME in order to use the right version of java. + + Drill will not compile correctly using java 6. There is also a subtle problem +that can occur if you have both +java 6 and java 7 with the default version set to 6. In that case, you may be +able to compile, but execution may not work correctly. + + Send email to the dev list if this is a problem for you. + + * Protobuf + + Drill requires Protobuf 2.5. Install this on Ubuntu using: + + apt-get install protobuf-compiler + + On Centos 6.4, OEL or RHEL you will need to compile protobuf-compiler: + + wget http://protobuf.googlecode.com/files/protobuf-2.5.0.tar.bz2 + tar xfj protobuf-2.5.0.tar.bz2 + pushd protobuf-2.5.0 + ./configure + make + sudo make install + + * git + + On Ubuntu you can install git by doing this as root: + + apt-get install git-all + +On the Mac or Windows, go to [this site](http://git-scm.com/downloads) to +download and install git. + +## Check your installation versions + +Run + + java -version + mvn -version + +Verify that your default java and maven versions are correct, and that maven +runs the right version of java. On my Mac, you see something like this: + + ted:apache-drill-1.0.0-m1$ java -version + java version "1.7.0_11" + Java(TM) SE Runtime Environment (build 1.7.0_11-b21) + Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode) + ted:apache-drill-1.0.0-m1$ mvn -version + Apache Maven 3.0.3 (r1075438; 2011-02-28 09:31:09-0800) + Maven home: /usr/share/maven + Java version: 1.7.0_11, vendor: Oracle Corporation + Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_11.jdk/Contents/Home/jre + Default locale: en_US, platform encoding: UTF-8 + OS name: "mac os x", version: "10.7.5", arch: "x86_64", family: "mac" + ted:apache-drill-1.0.0-m1$ + +## Get the Source + + git clone https://git-wip-us.apache.org/repos/asf/incubator-drill.git + +## Compile the Code + + cd incubator-drill/sandbox/prototype + mvn clean install -DskipTests + rm .classpath + +This takes about a minute on a not-terribly-current MacBook. + +## Run the interactive Drill shell + + ./sqlline -u jdbc:drill:schema=parquet-local -n admin -p admin + +The first time you run this program, you will get reams of output. What is +happening is that the program is running maven in order to build a +(voluminous) class path for the actual program and stores this classpath into +the file called `.classpath`. When you run this program again, it will note +that this file already exists and avoid re-creating it. You should delete this +file every time the dependencies of Drill are changed. If you start getting +"class not found" errors, that is a good hint that `.classpath` is out of date +and needs to be deleted and recreated. + +The `-u` argument to [sqlline](https://github.com/julianhyde/sqlline) is a +JDBC connection string that directs sqlline to connect to drill. The Drill +JDBC driver currently includes enough smarts to run Drill in embedded mode so +this command also effectively starts a local drill bit. The `schema=` part of +the JDBC connection string causes Drill to consider the "parquet-local" +storage engine to be default. Other storage engines can be specified. The list +of supported storage engines can be found in the file +`./sqlparser/src/main/resources/storage-engines.json`. Each storage engine +specifies the format of the data and how to get the data. See the section +below on "Storage Engines" for more detail. + +When you run sqlline, you should see something like this after quite a lot of +log messages: + + Connected to: Drill (version 1.0) + Driver: Apache Drill JDBC Driver (version 1.0) + Autocommit status: true + Transaction isolation: TRANSACTION_REPEATABLE_READ + sqlline version ??? by Marc Prud'hommeaux + 0: jdbc:drill:schema=parquet-local> + +**Tip:** To quit sqlline at any time, type "!quit" at the prompt. + + 0: jdbc:drill:schema=parquet-local> !quit + +## Run a Query + +Once you have sqlline running you can now try out some queries: + + select * from "sample-data/region.parquet"; + +You should see a number of debug messages and then something like this: + + +--------------------------------------------------------------------------------------------------------------------------------------------+ + | _MAP | + +--------------------------------------------------------------------------------------------------------------------------------------------+ + | {"R_REGIONKEY":0,"R_NAME":"AFRICA","R_COMMENT":"lar deposits. blithely final packages cajole. regular waters are final requests. regular a | + | {"R_REGIONKEY":1,"R_NAME":"AMERICA","R_COMMENT":"hs use ironic, even requests. s"} | + | {"R_REGIONKEY":2,"R_NAME":"ASIA","R_COMMENT":"ges. thinly even pinto beans ca"} | + | {"R_REGIONKEY":3,"R_NAME":"EUROPE","R_COMMENT":"ly final courts cajole furiously final excuse"} | + | {"R_REGIONKEY":4,"R_NAME":"MIDDLE EAST","R_COMMENT":"uickly special accounts cajole carefully blithely close requests. carefully final asy | + +--------------------------------------------------------------------------------------------------------------------------------------------+ + 5 rows selected (1.103 seconds) + +Drill has no idea what the structure of this +file is in terms of what fields exist, but Drill does know that every record +has a pseudo-field called `_MAP`. This field contains a map of all of the +actual fields to values. When returned via JDBC, this fields is rendered as +JSON since JDBC doesn't really understand maps. + +This can be made more readable by using a query like this: + + select _MAP['R_REGIONKEY'] as region_key, _MAP['R_NAME'] as name, _MAP['R_COMMENT'] as comment + from "sample-data/region.parquet"; + +The output will look something like this: + + +-------------+--------------+---------------------------------------------------------------------------------------------------------------+ + | REGION_KEY | NAME | COMMENT | + +-------------+--------------+---------------------------------------------------------------------------------------------------------------+ + | 0 | AFRICA | lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are accordi | + | 1 | AMERICA | hs use ironic, even requests. s | + | 2 | ASIA | ges. thinly even pinto beans ca | + | 3 | EUROPE | ly final courts cajole furiously final excuse | + | 4 | MIDDLE EAST | uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl | + +-------------+--------------+---------------------------------------------------------------------------------------------------------------+ + +In upcoming versions, Drill will insert the `_MAP[ ... ]` goo and will also +unwrap the contents of `_MAP` in results so that things seem much more like +ordinary SQL. The reason that things work this way now is that SQL itself +requires considerable type information for queries to be parsed and that +information doesn't necessarily exist for all kinds of files, especially those +with very flexible schemas. To avoid all these problems, Drill adopts the +convention of the _MAP fields for all kinds of input. + +## A Note Before You Continue + +Drill currently supports a wide variety of queries. It currently also has a +fair number of deficiencies in terms of the number of operators that are +actually supported and exactly which expressions are passed through to the +execution engine in the correct form for execution. + +These problems fall into roughly three categories, + + * missing operators. Many operators have been implemented for only a subset of the types available in Drill. This will cause queries to work for some types of data, but not for others. This is particularly true for operators with many possible type signatures such as comparisons. This lack is being remedied at a fast pace so check back in frequently if you suspect this might be a problem. + + Missing operators will result in error messages like this: + + `UnsupportedOperationException:[ Missing function implementation: compare_to +(BIT-OPTIONAL, BIT-OPTIONAL) ]` + + * missing casts. The SQL parser currently has trouble producing a valid logical plan without sufficient type information. Ironically, this type information is often not necessary to the execution engine because Drill generates the code on the fly based on the types of the data it encounters as data are processed. Currently, the work-around is to cast fields in certain situations to give the parser enough information to proceed. This problem will be remedied soon, but probably not quite as quickly as the missing operators. + + The typical error message that indicates you need an additional cast looks +like + + `Cannot apply '>' to arguments of type ' > '. Supported +form(s): ' > '` + + * weak optimizer. The current optimizer that transforms the logical plan into a physical plan is not the fully-featured cost based optimizer that Optiq normally uses. This is because some of the transformations that are needed for Drill are not yet fully supported by Optiq. In order to allow end-to-end execution of queries, a deterministic peep-hole optimizer has been used instead. This optimizer cannot handle large plan transformations and so some queries cannot be transformed correctly from logical to physical plan. We expect that the necessary changes to the cost-based optimizer will allow it to be used in an upcoming release, but didn't want to delay the current release waiting for that to happen. + +## Try Fancier Queries + +This query does a join between two files: + + SELECT nations.name, regions.name FROM ( + SELECT _MAP['N_REGIONKEY'] as regionKey, _MAP['N_NAME'] as name + FROM "sample-data/nation.parquet") nations + join ( + SELECT _MAP['R_REGIONKEY'] as regionKey, _MAP['R_NAME'] as name + FROM "sample-data/region.parquet") regions + on nations.regionKey = regions.regionKey + order by nations.name; + +Notice the use of sub-queries to avoid the spread of the `_MAP` idiom. + +This query illustrates how a cast is currently necessary to make the parser +happy: + + SELECT + _MAP['N_REGIONKEY'] as regionKey, + _MAP['N_NAME'] as name + FROM + "sample-data/nation.parquet" + WHERE + cast(_MAP['N_NAME'] as varchar) IN ('MOROCCO', 'MOZAMBIQUE'); + +Here are more queries that you can try. + + // count distinct + SELECT count(distinct _MAP['N_REGIONKEY']) FROM "sample-data/nation.parquet"; + + // aliases + SELECT + _MAP['N_REGIONKEY'] as regionKey, + _MAP['N_NAME'] as name + FROM "sample-data/nation.parquet"; + + // order by + SELECT + _MAP['N_REGIONKEY'] as regionKey, + _MAP['N_NAME'] as name + FROM + "sample-data/nation.parquet" + ORDER BY + _MAP['N_NAME'] DESC; + + // subquery order by + select * from ( + SELECT + _MAP['N_REGIONKEY'] as regionKey, + _MAP['N_NAME'] as name + FROM + "sample-data/nation.parquet" + ) as x + ORDER BY + name DESC; + + // String where + SELECT + _MAP['N_REGIONKEY'] as regionKey, + _MAP['N_NAME'] as name + FROM + "sample-data/nation.parquet" + WHERE + cast(_MAP['N_NAME'] as varchar) > 'M'; + + // INNER Join + Order (parquet) + SELECT n.name, r.name FROM + (SELECT _MAP['N_REGIONKEY'] as regionKey, _MAP['N_NAME'] as name FROM "sample-data/nation.parquet")n + join (SELECT _MAP['R_REGIONKEY'] as regionKey, _MAP['R_NAME'] as name FROM "sample-data/region.parquet")r + using (regionKey); + + // INNER Join + Order (parquet) + SELECT n.name, r.name FROM + (SELECT _MAP['N_REGIONKEY'] as regionKey, _MAP['N_NAME'] as name FROM "sample-data/nation.parquet")n + join (SELECT _MAP['R_REGIONKEY'] as regionKey, _MAP['R_NAME'] as name FROM "sample-data/region.parquet")r + on n.regionKey = r.regionKey + order by n.name; + +## Analyze the Execution of Queries + +Drill sends log events to a logback socket appender. This makes it easy to +catch and filter these log events using a tool called Lilith. You can download +[Lilith](http://www.huxhorn.de/) and install it easily. A tutorial can be +[found here](http://ekkescorner.wordpress.com/2009/09/05/osgi-logging-part-8 +-viewing-log-events-lilith/). This is especially important if you find errors +that you want to report back to the mailing list since Lilith will help you +isolate the stack trace of interest. + +By default, Lilith uses a slightly lurid splash page based on a pre-Raphaelite +image of the mythical Lilith. This is easily disabled if the image is not to +your taste (or if your work-mates are not well-versed in Victorian views of +Sumerian mythology). + http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/archive/002-meet-drill.md ---------------------------------------------------------------------- diff --git a/_docs/archive/002-meet-drill.md b/_docs/archive/002-meet-drill.md new file mode 100644 index 0000000..aa9556b --- /dev/null +++ b/_docs/archive/002-meet-drill.md @@ -0,0 +1,41 @@ +--- +title: "What is Apache Drill" +parent: "Archived Pages" +--- +## What is Apache Drill + +Apache Drill by Apache Foundation is the first open source implementation of +the Google's Dremel paper for interactive query processing. Apache Drill +provides low latency ad-hoc queries to many different data sources & nested +data. Drill is designed to scale to 10,000 servers and query petabytes of data +in seconds. + +![drill query flow]({{ site.baseurl }}/docs/img/drill2.png) + +In a nutshell, Few key points about Apache Drill are: + + * Inspired by Google's Dremel + * Supports standard SQL 2003 + * Supports plug-able data sources (HBase, Mongo, HDFS etc) + * Supports nested data (JSON, ProtoBufs, Parquet etc) + * Supports optional schema + * Community driven + +## Where Apache Drill fits in + +Apache Drill is designed as an answer to the Interactive queries problems that +we face while dealing with huge data. A standard Drill Query might take 100ms +- 3 minutes for its execution as compared to Apache Hadoop or HIVE/PIG. Below +is a diagram to help you relate in terms of the execution times: + +![drill query flow]({{ site.baseurl }}/docs/img/drill-runtime.png) + +## Drill is powerful + +Below are few things that make Apache Drill really powerful: + + * **Speed**: Apache Drill uses an efficient columnar storage format, an optimistic execution engine and a cache-conscious memory layout. Coordination, query planning, optimization, scheduling, and execution are all distributed throughout nodes in a system to maximize parallelization. Apache Drill is blazing fast. Period. + * **Plug-able data sources**: Apache drill brings in the support of pluggable data sources like HBase, Mongo, HDFS etc. It means that Drill will still work comfortably while your data is exploring new data stores. + * **Nested data**: With the support for data sources like HBase, Cassandra, MongoDB etc, Drill allows interactive analysis on all of your data, including nested and schema-less forms. Drill also supports querying against nested data formats like JSON and Parquet. + * **Flexibility**: Apache Dril strongly defined tiers and APIs for straightforward integration with a wide array of technologies. + http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/connect/001-plugin-reg.md ---------------------------------------------------------------------- diff --git a/_docs/connect/001-plugin-reg.md b/_docs/connect/001-plugin-reg.md new file mode 100644 index 0000000..6e1e679 --- /dev/null +++ b/_docs/connect/001-plugin-reg.md @@ -0,0 +1,35 @@ +--- +title: "Storage Plugin Registration" +parent: "Connect to Data Sources" +--- +You can connect Drill to a file system, Hive, or HBase data source. To connect +Drill to a data source, you must register the data source as a storage plugin +instance in the Drill Web UI. You register an instance of a data source as a +`file`, `hive`, or `hbase` storage plugin type. You can register multiple +storage plugin instances for each storage plugin type. + +Each node with a Drillbit installed has a running browser that provides access +to the Drill Web UI at `http://localhost:8047/`. The Drill Web UI includes +`cp`, `dfs`, `hive`, and `hbase` storage plugin instances by default, though +the `hive` and `hbase` instances are disabled. You can update the `hive` and +`hbase` instances with configuration details and then enable them. + +The `cp` instance points to a JAR file in Drill’s classpath that contains +sample data that you can query. By default, the `dfs` instance points to the +local file system on your machine, but you can configure this instance to +point to any distributed file system, such as a Hadoop or S3 file system. + +When you add or update storage plugin instances on one Drill node in a Drill +cluster, Drill broadcasts the information to all of the other Drill nodes so +they all have identical storage plugin configurations. You do not need to +restart any of the Drillbits when you add or update a storage plugin instance. + +Each storage plugin instance that you register with Drill must have a distinct +name. For example, if you register two storage plugin instances for a Hadoop +file system, you might name one storage plugin instance `hdfstest` and the +other instance `hdfsproduction`. + +The following example shows an HDFS data source registered in the Drill Web UI +as a storage plugin instance of plugin type "`file"`: + +![drill query flow]({{ site.baseurl }}/docs/img/StoragePluginConfig.png) \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/connect/002-workspaces.md ---------------------------------------------------------------------- diff --git a/_docs/connect/002-workspaces.md b/_docs/connect/002-workspaces.md new file mode 100644 index 0000000..745d61b --- /dev/null +++ b/_docs/connect/002-workspaces.md @@ -0,0 +1,74 @@ +--- +title: "Workspaces" +parent: "Storage Plugin Registration" +--- +When you register an instance of a file system data source, you can configure +one or more workspaces for the instance. A workspace is a directory within the +file system that you define. Drill searches the workspace to locate data when +you run a query. + +Each workspace that you register defines a schema that you can connect to and +query. Configuring workspaces is useful when you want to run multiple queries +on files or tables in a specific directory. You cannot create workspaces for +`hive` and `hbase` instances, though Hive databases show up as workspaces in +Drill. + +The following example shows an instance of a file type storage plugin with a +workspace named `json` configured to point Drill to the +`/users/max/drill/json/` directory in the local file system `(dfs)`: + + { + "type" : "file", + "enabled" : true, + "connection" : "file:///", + "workspaces" : { + "json" : { + "location" : "/users/max/drill/json/", + "writable" : false, + "storageformat" : json + } + }, + +**Note:** The `connection` parameter in the configuration above is "`file:///`", connecting Drill to the local file system (`dfs`). To connect to a Hadoop or MapR file system the `connection` parameter would be "`hdfs:///" `or` "maprfs:///", `respectively. + +To query a file in the example `json` workspace, you can issue the `USE` +command to tell Drill to use the `json` workspace configured in the `dfs` +instance for each query that you issue: + +**Example** + + USE dfs.json; + SELECT * FROM dfs.json.`donuts.json` WHERE type='frosted' + +If the `json `workspace did not exist, the query would have to include the +full path to the `donuts.json` file: + + SELECT * FROM dfs.`/users/max/drill/json/donuts.json` WHERE type='frosted'; + +Using a workspace alleviates the need to repeatedly enter the directory path +in subsequent queries on the directory. + +### Default Workspaces + +Each `file` and `hive` instance includes a `default` workspace. The `default` +workspace points to the file system or to the Hive metastore. When you query +files and tables in the` file` or `hive default` workspaces, you can omit the +workspace name from the query. + +For example, you can issue a query on a Hive table in the `default workspace` +using either of the following formats and get the the same results: + +**Example** + + SELECT * FROM hive.customers LIMIT 10; + SELECT * FROM hive.`default`.customers LIMIT 10; + +**Note:** Default is a reserved word. You must enclose reserved words in back ticks. + +Because HBase instances do not have workspaces, you can use the following +format to query a table in HBase: + + SELECT * FROM hbase.customers LIMIT 10; + +After you register a data source as a storage plugin instance with Drill, and +optionally configure workspaces, you can query the data source. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/connect/003-reg-fs.md ---------------------------------------------------------------------- diff --git a/_docs/connect/003-reg-fs.md b/_docs/connect/003-reg-fs.md new file mode 100644 index 0000000..ee385cd --- /dev/null +++ b/_docs/connect/003-reg-fs.md @@ -0,0 +1,64 @@ +--- +title: "Registering a File System" +parent: "Storage Plugin Registration" +--- +You can register a storage plugin instance that connects Drill to a local file +system or a distributed file system registered in `core-site.xml`, such as S3 +or HDFS. When you register a storage plugin instance for a file system, +provide a unique name for the instance, and identify the type as “`file`”. By +default, Drill includes an instance named `dfs `that points to the local file +system on your machine. You can update this configuration to point to a +distributed file system or you can create a new instance to point to a +distributed file system. + +To register a local or a distributed file system with Apache Drill, complete +the following steps: + + 1. Navigate to `[http://localhost:8047](http://localhost:8047/)`, and select the **Storage** tab. + 2. In the New Storage Plugin window, enter a unique name and then click **Create**. + 3. In the Configuration window, provide the following configuration information for the type of file system that you are configuring as a data source. + 1. Local file system example: + + { + "type": "file", + "enabled": true, + "connection": "file:///", + "workspaces": { + "root": { + "location": "/user/max/donuts", + "writable": false, + "storageformat": null + } + }, + "formats" : { + "json" : { + "type" : "json" + } + } + } + 2. Distributed file system example: + + { + "type" : "file", + "enabled" : true, + "connection" : "hdfs://10.10.30.156:8020/", + "workspaces" : { + "root : { + "location" : "/user/root/drill", + "writable" : true, + "storageformat" : "null" + } + }, + "formats" : { + "json" : { + "type" : "json" + } + } + } + + To connect to a Hadoop file system, you must include the IP address of the +name node and the port number. + 4. Click **Enable**. + +Once you have configured a storage plugin instance for the file system, you +can issue Drill queries against it. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/connect/004-reg-hbase.md ---------------------------------------------------------------------- diff --git a/_docs/connect/004-reg-hbase.md b/_docs/connect/004-reg-hbase.md new file mode 100644 index 0000000..0efd435 --- /dev/null +++ b/_docs/connect/004-reg-hbase.md @@ -0,0 +1,32 @@ +--- +title: "Registering HBase" +parent: "Storage Plugin Registration" +--- +Register a storage plugin instance and specify a zookeeper quorum to connect +Drill to an HBase data source. When you register a storage plugin instance for +an HBase data source, provide a unique name for the instance, and identify the +type as “hbase” in the Drill Web UI. + +Currently, Drill only works with HBase version 0.94. + +To register HBase with Drill, complete the following steps: + + 1. Navigate to [http://localhost:8047](http://localhost:8047/), and select the **Storage** tab + 2. In the disabled storage plugins section, click **Update** next to the `hbase` instance. + 3. In the Configuration window, specify the Zookeeper quorum and port. + + **Example** + + { + "type": "hbase", + "config": { + "hbase.zookeeper.quorum": " or ", + "hbase.zookeeper.property.clientPort": "2181" + }, + "enabled": false + } + + 4. Click **Enable**. + +Once you have configured a storage plugin instance for the HBase, you can +issue Drill queries against it. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/connect/005-reg-hive.md ---------------------------------------------------------------------- diff --git a/_docs/connect/005-reg-hive.md b/_docs/connect/005-reg-hive.md new file mode 100644 index 0000000..564bebc --- /dev/null +++ b/_docs/connect/005-reg-hive.md @@ -0,0 +1,83 @@ +--- +title: "Registering Hive" +parent: "Storage Plugin Registration" +--- +You can register a storage plugin instance that connects Drill to a Hive data +source that has a remote or embedded metastore service. When you register a +storage plugin instance for a Hive data source, provide a unique name for the +instance, and identify the type as “`hive`”. You must also provide the +metastore connection information. + +Currently, Drill only works with Hive version 0.12. To access Hive tables +using custom SerDes or InputFormat/OutputFormat, all nodes running Drillbits +must have the SerDes or InputFormat/OutputFormat `JAR` files in the +`/jars/3rdparty` folder. + +## Hive Remote Metastore + +In this configuration, the Hive metastore runs as a separate service outside +of Hive. Drill communicates with the Hive metastore through Thrift. The +metastore service communicates with the Hive database over JDBC. Point Drill +to the Hive metastore service address, and provide the connection parameters +in the Drill Web UI to configure a connection to Drill. + +**Note:** Verify that the Hive metastore service is running before you register the Hive metastore. + +To register a remote Hive metastore with Drill, complete the following steps: + + 1. Issue the following command to start the Hive metastore service on the system specified in the `hive.metastore.uris`: + + hive --service metastore + 2. Navigate to [http://localhost:8047](http://localhost:8047/), and select the **Storage** tab. + 3. In the disabled storage plugins section, click **Update** next to the `hive` instance. + 4. In the configuration window, add the `Thrift URI` and port to `hive.metastore.uris`. + + **Example** + + { + "type": "hive", + "enabled": true, + "configProps": { + "hive.metastore.uris": "thrift://:", + "hive.metastore.sasl.enabled": "false" + } + } + 5. Click **Enable**. + 6. Verify that `HADOOP_CLASSPATH` is set in `drill-env.sh`. If you need to set the classpath, add the following line to `drill-env.sh`. + +Once you have configured a storage plugin instance for a Hive data source, you +can [query Hive tables](/drill/docs/querying-hive/). + +## Hive Embedded Metastore + +In this configuration, the Hive metastore is embedded within the Drill +process. Provide the metastore database configuration settings in the Drill +Web UI. Before you register Hive, verify that the driver you use to connect to +the Hive metastore is in the Drill classpath located in `//lib/.` If the driver is not there, copy the driver to `//lib` on the Drill node. For more information about +storage types and configurations, refer to [AdminManual +MetastoreAdmin](/confluence/display/Hive/AdminManual+MetastoreAdmin). + +To register an embedded Hive metastore with Drill, complete the following +steps: + + 1. Navigate to `[http://localhost:8047](http://localhost:8047/)`, and select the **Storage** tab + 2. In the disabled storage plugins section, click **Update** next to `hive` instance. + 3. In the configuration window, add the database configuration settings. + + **Example** + + { + "type": "hive", + "enabled": true, + "configProps": { + "javax.jdo.option.ConnectionURL": "jdbc::///;create=true", + "hive.metastore.warehouse.dir": "/tmp/drill_hive_wh", + "fs.default.name": "file:///", + } + } + 4. Click** Enable.** + 5. Verify that `HADOOP_CLASSPATH` is set in `drill-env.sh`. If you need to set the classpath, add the following line to `drill-env.sh`. + + export HADOOP_CLASSPATH=//hadoop/hadoop-0.20.2 \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/connect/006-default-frmt.md ---------------------------------------------------------------------- diff --git a/_docs/connect/006-default-frmt.md b/_docs/connect/006-default-frmt.md new file mode 100644 index 0000000..7dc55d5 --- /dev/null +++ b/_docs/connect/006-default-frmt.md @@ -0,0 +1,60 @@ +--- +title: "Drill Default Input Format" +parent: "Storage Plugin Registration" +--- +You can define a default input format to tell Drill what file type exists in a +workspace within a file system. Drill determines the file type based on file +extensions and magic numbers when searching a workspace. + +Magic numbers are file signatures that Drill uses to identify Parquet files. +If Drill cannot identify the file type based on file extensions or magic +numbers, the query fails. Defining a default input format can prevent queries +from failing in situations where Drill cannot determine the file type. + +If you incorrectly define the file type in a workspace and Drill cannot +determine the file type, the query fails. For example, if the directory for +which you have defined a workspace contains JSON files and you defined the +default input format as CSV, the query fails against the workspace. + +You can define one default input format per workspace. If you do not define a +default input format, and Drill cannot detect the file format, the query +fails. You can define a default input format for any of the file types that +Drill supports. Currently, Drill supports the following types: + + * CSV + * TSV + * PSV + * Parquet + * JSON + +## Defining a Default Input Format + +You define the default input format for a file system workspace through the +Drill Web UI. You must have a [defined workspace](/drill/docs/workspaces) before you can define a +default input format. + +To define a default input format for a workspace, complete the following +steps: + + 1. Navigate to the Drill Web UI at `:8047`. The Drillbit process must be running on the node before you connect to the Drill Web UI. + 2. Select **Storage** in the toolbar. + 3. Click **Update** next to the file system for which you want to define a default input format for a workspace. + 4. In the Configuration area, locate the workspace for which you would like to define the default input format, and change the `defaultInputFormat` attribute to any of the supported file types. + + **Example** + + { + "type": "file", + "enabled": true, + "connection": "hdfs:///", + "workspaces": { + "root": { + "location": "/drill/testdata", + "writable": false, + "defaultInputFormat": csv + }, + "local" : { + "location" : "/max/proddata", + "writable" : true, + "defaultInputFormat" : "json" + } \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/connect/007-mongo-plugin.md ---------------------------------------------------------------------- diff --git a/_docs/connect/007-mongo-plugin.md b/_docs/connect/007-mongo-plugin.md new file mode 100644 index 0000000..fd5dba8 --- /dev/null +++ b/_docs/connect/007-mongo-plugin.md @@ -0,0 +1,167 @@ +--- +title: "MongoDB Plugin for Apache Drill" +parent: "Connect to Data Sources" +--- +## Overview + +You can leverage the power of Apache Drill to query data without any upfront +schema definitions. Drill enables you to create an architecture that works +with nested and dynamic schemas, making it the perfect SQL query tool to use +on NoSQL databases, such as MongoDB. + +As of Apache Drill 0.6, you can configure MongoDB as a Drill data source. +Drill provides a mongodb format plugin to connect to MongoDB, and run queries +on the data using ANSI SQL. + +This tutorial assumes that you have Drill installed locally (embedded mode), +as well as MongoDB. Examples in this tutorial use zip code aggregation data +provided by MongoDB. Before You Begin provides links to download tools and data +used throughout the tutorial. + +**Note:** A local instance of Drill is used in this tutorial for simplicity. You can also run Drill and MongoDB together in distributed mode. + +### Before You Begin + +Before you can query MongoDB with Drill, you must have Drill and MongoDB +installed on your machine. You may also want to import the MongoDB zip code +data to run the example queries on your machine. + + 1. [Install Drill](/drill/docs/installing-drill-in-embedded-mode), if you do not already have it installed on your machine. + 2. [Install MongoDB](http://docs.mongodb.org/manual/installation), if you do not already have it installed on your machine. + 3. [Import the MongoDB zip code sample data set](http://docs.mongodb.org/manual/tutorial/aggregation-zip-code-data-set). You can use Mongo Import to get the data. + +## Configuring MongoDB + +Start Drill and configure the MongoDB storage plugin instance in the Drill Web +UI to connect to Drill. Drill must be running in order to access the Web UI. + +Complete the following steps to configure MongoDB as a data source for Drill: + + 1. Navigate to `/drill-,` and enter the following command to invoke SQLLine and start Drill: + + bin/sqlline -u jdbc:drill:zk=local -n admin -p admin + When Drill starts, the following prompt appears: `0: jdbc:drill:zk=local>` + + Do not enter any commands. You will return to the command prompt after +completing the configuration in the Drill Web UI. + 2. Open a browser window, and navigate to the Drill Web UI at `http://localhost:8047`. + 3. In the navigation bar, click **Storage**. + 4. Under Disabled Storage Plugins, select **Update** next to the `mongo` instance if the instance exists. If the instance does not exist, create an instance for MongoDB. + 5. In the Configuration window, verify that `"enabled"` is set to ``"true."`` + + **Example** + + { + "type": "mongo", + "connection": "mongodb://localhost:27017/", + "enabled": true + } + + **Note:** 27017 is the default port for `mongodb` instances. + 6. Click **Enable** to enable the instance, and save the configuration. + 7. Navigate back to the Drill command line so you can query MongoDB. + +## Querying MongoDB + +You can issue the `SHOW DATABASES `command to see a list of databases from all +Drill data sources, including MongoDB. If you downloaded the zip codes file, +you should see `mongo.zipdb` in the results. + + 0: jdbc:drill:zk=local> SHOW DATABASES; + +-------------+ + | SCHEMA_NAME | + +-------------+ + | dfs.default | + | dfs.root | + | dfs.tmp | + | sys | + | mongo.zipdb | + | cp.default | + | INFORMATION_SCHEMA | + +-------------+ + +If you want all queries that you submit to run on `mongo.zipdb`, you can issue +the `USE` command to change schema. + +### Example Queries + +The following example queries are included for reference. However, you can use +the SQL power of Apache Drill directly on MongoDB. For more information about, +refer to the [SQL +Reference](/drill/docs/sql-reference). + +**Example 1: View mongo.zipdb Dataset** + + 0: jdbc:drill:zk=local> SELECT * FROM zipcodes LIMIT 10; + +------------+ + | * | + +------------+ + | { "city" : "AGAWAM" , "loc" : [ -72.622739 , 42.070206] , "pop" : 15338 , "state" : "MA"} | + | { "city" : "CUSHMAN" , "loc" : [ -72.51565 , 42.377017] , "pop" : 36963 , "state" : "MA"} | + | { "city" : "BARRE" , "loc" : [ -72.108354 , 42.409698] , "pop" : 4546 , "state" : "MA"} | + | { "city" : "BELCHERTOWN" , "loc" : [ -72.410953 , 42.275103] , "pop" : 10579 , "state" : "MA"} | + | { "city" : "BLANDFORD" , "loc" : [ -72.936114 , 42.182949] , "pop" : 1240 , "state" : "MA"} | + | { "city" : "BRIMFIELD" , "loc" : [ -72.188455 , 42.116543] , "pop" : 3706 , "state" : "MA"} | + | { "city" : "CHESTER" , "loc" : [ -72.988761 , 42.279421] , "pop" : 1688 , "state" : "MA"} | + | { "city" : "CHESTERFIELD" , "loc" : [ -72.833309 , 42.38167] , "pop" : 177 , "state" : "MA"} | + | { "city" : "CHICOPEE" , "loc" : [ -72.607962 , 42.162046] , "pop" : 23396 , "state" : "MA"} | + | { "city" : "CHICOPEE" , "loc" : [ -72.576142 , 42.176443] , "pop" : 31495 , "state" : "MA"} | + +**Example 2: Aggregation** + + 0: jdbc:drill:zk=local> select state,city,avg(pop) + +------------+------------+------------+ + | state | city | EXPR$2 | + +------------+------------+------------+ + | MA | AGAWAM | 15338.0 | + | MA | CUSHMAN | 36963.0 | + | MA | BARRE | 4546.0 | + | MA | BELCHERTOWN | 10579.0 | + | MA | BLANDFORD | 1240.0 | + | MA | BRIMFIELD | 3706.0 | + | MA | CHESTER | 1688.0 | + | MA | CHESTERFIELD | 177.0 | + | MA | CHICOPEE | 27445.5 | + | MA | WESTOVER AFB | 1764.0 | + +------------+------------+------------+ + +**Example 3: Nested Data Column Array** + + 0: jdbc:drill:zk=local> SELECT loc FROM zipcodes LIMIT 10; + +------------------------+ + | loc | + +------------------------+ + | [-72.622739,42.070206] | + | [-72.51565,42.377017] | + | [-72.108354,42.409698] | + | [-72.410953,42.275103] | + | [-72.936114,42.182949] | + | [-72.188455,42.116543] | + | [-72.988761,42.279421] | + | [-72.833309,42.38167] | + | [-72.607962,42.162046] | + | [-72.576142,42.176443] | + +------------------------+ + + 0: jdbc:drill:zk=local> SELECT loc[0] FROM zipcodes LIMIT 10; + +------------+ + | EXPR$0 | + +------------+ + | -72.622739 | + | -72.51565 | + | -72.108354 | + | -72.410953 | + | -72.936114 | + | -72.188455 | + | -72.988761 | + | -72.833309 | + | -72.607962 | + | -72.576142 | + +------------+ + +## Using ODBC/JDBC Drivers + +You can leverage the power of Apache Drill to query MongoDB through standard +BI tools, such as Tableau and SQuirreL. + +For information about Drill ODBC and JDBC drivers, refer to [Drill Interfaces](/drill/docs/odbc-jdbc-interfaces). http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/connect/008-mapr-db-plugin.md ---------------------------------------------------------------------- diff --git a/_docs/connect/008-mapr-db-plugin.md b/_docs/connect/008-mapr-db-plugin.md new file mode 100644 index 0000000..f923bce --- /dev/null +++ b/_docs/connect/008-mapr-db-plugin.md @@ -0,0 +1,31 @@ +--- +title: "MapR-DB Plugin for Apache Drill" +parent: "Connect to Data Sources" +--- +Drill includes a `maprdb` format plugin for MapR-DB that is defined within the +default `dfs` storage plugin instance when you install Drill from the `mapr-drill` package on a MapR node. The `maprdb` format plugin improves the +estimated number of rows that Drill uses to plan a query. It also enables you +to query tables like you would query files in a file system because MapR-DB +and MapR-FS share the same namespace. + +You can query tables stored across multiple directories. You do not need to +create a table mapping to a directory before you query a table in the +directory. You can select from any table in any directory the same way you +would select from files in MapR-FS, using the same syntax. + +Instead of including the name of a file, you include the table name in the +query. + +**Example** + + SELECT * FROM mfs.`/users/max/mytable`; + +Drill stores the `maprdb` format plugin in the `dfs` storage plugin instance, +which you can view in the Drill Web UI. You can access the Web UI at +[http://localhost:8047/storage](http://localhost:8047/storage). Click **Update** next to the `dfs` instance +in the Web UI to view the configuration for the `dfs` instance. + +The following image shows a portion of the configuration with the `maprdb` +format plugin for the `dfs` instance: + +![drill query flow]({{ site.baseurl }}/docs/img/18.png) http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/contribute/001-guidelines.md ---------------------------------------------------------------------- diff --git a/_docs/contribute/001-guidelines.md b/_docs/contribute/001-guidelines.md new file mode 100644 index 0000000..686d972 --- /dev/null +++ b/_docs/contribute/001-guidelines.md @@ -0,0 +1,229 @@ +--- +title: "Apache Drill Contribution Guidelines" +parent: "Contribute to Drill" +--- +## How to Contribute to Apache Drill + +Disclaimer: These contribution guidelines are largely based on Apache Hive +contribution guidelines. + +This page describes the mechanics of _how_ to contribute software to Apache +Drill. For ideas about _what_ you might contribute, please see open tickets in +[Jira](https://issues.apache.org/jira/browse/DRILL). + +These guidelines include the following topics: + +* Getting the source code + * Making Changes + * Coding Convention + * Formatter configuration + * Understanding Maven + * Creating a patch + * Applying a patch + * Where is a good place to start contributing? + * Contributing your work + * JIRA Guidelines + * See Also + +### Getting the source code + +First, you need the Drill source code. + +Get the source code on your local drive using [Git](https://git-wip- +us.apache.org/repos/asf/incubator-drill.git). Most development is done on +"master": + + git clone https://git-wip-us.apache.org/repos/asf/drill.git + +### Making Changes + +Before you start, send a message to the [Drill developer mailing list](http +://mail-archives.apache.org/mod_mbox/incubator-drill-dev/), or file a bug +report in [JIRA](https://issues.apache.org/jira/browse/DRILL). Describe your +proposed changes and check that they fit in with what others are doing and +have planned for the project. Be patient, it may take folks a while to +understand your requirements. + +Modify the source code and add some features using your favorite IDE. + +#### Coding Convention + +Please take care about the following points + + * All public classes and methods should have informative [Javadoc comments](http://www.oracle.com/technetwork/java/javase/documentation/index-137868.html). + * Do not use @author tags. + * Code should be formatted according to [Sun's conventions](http://www.oracle.com/technetwork/java/codeconv-138413.html), with one exception: + * Indent two (2) spaces per level, not four (4). + * Line length limit is 120 chars, instead of 80 chars. + * Contributions should not introduce new Checkstyle violations. + * Contributions should pass existing unit tests. + * New unit tests should be provided to demonstrate bugs and fixes. [JUnit](http://www.junit.org) 4.1 is our test framework: + * You must implement a class that contain test methods annotated with JUnit's 4.x @Test annotation and whose class name ends with `Test`. + * Define methods within your class whose names begin with `test`, and call JUnit's many assert methods to verify conditions; these methods will be executed when you run `mvn clean test`. + +#### Formatter configuration + +Setting up IDE formatters is recommended and can be done by importing the +following settings into your browser: + +IntelliJ IDEA formatter: [settings +jar](/confluence/download/attachments/30757399/idea- +settings.jar?version=1&modificationDate=1363022308000&api=v2) + +Eclipse: [formatter xml from HBase](https://issues.apache.org/jira/secure/atta +chment/12474245/eclipse_formatter_apache.xml) + +#### Understanding Maven + +Drill is built by Maven, a Java build tool. + + * Good Maven tutorial: + +To build Drill, run + + mvn clean install + + +#### Creating a patch + +Check to see what files you have modified: + + git status + +Add any new files with: + + git add .../MyNewClass.java + git add .../TestMyNewClass.java + git add .../XXXXXX.q + git add .../XXXXXX.q.out + +In order to create a patch, type (from the base directory of drill): + + git format-patch origin/master --stdout > DRILL-1234.1.patch.txt + +This will report all modifications done on Drill sources on your local disk +and save them into the _DRILL-1234.1.patch.txt_ file. Read the patch file. +Make sure it includes ONLY the modifications required to fix a single issue. + +Please do not: + + * reformat code unrelated to the bug being fixed: formatting changes should be separate patches/commits. + * comment out code that is now obsolete: just remove it. + * insert comments around each change, marking the change: folks can use subversion to figure out what's changed and by whom. + * make things public which are not required by end users. + +Please do: + + * try to adhere to the coding style of files you edit; + * comment code whose function or rationale is not obvious; + * update documentation (e.g., _package.html_ files, this wiki, etc.) + +Updating a patch + +For patch updates, our convention is to number them like +DRILL-1856.1.patch.txt, DRILL-1856.2.patch.txt, etc. And then click the +"Submit Patch" button again when a new one is uploaded; this makes sure it +gets back into the review queue. Appending '.txt' to the patch file name makes +it easy to quickly view the contents of the patch in a web browser. + +#### Applying a patch + +To apply a patch either you generated or found from JIRA, you can issue + + git am < cool_patch.patch + +if you just want to check whether the patch applies you can run patch with +--dry-run option. + + + +####Review Process + + * Use Hadoop's [code review checklist](http://wiki.apache.org/hadoop/CodeReviewChecklist) as a rough guide when doing reviews. + * In JIRA, use attach file to notify that you've submitted a patch for that issue. + * Create a Review Request in [Review Board](https://reviews.apache.org/r/). The review request's name should start with the JIRA issue number (e.g. DRILL-XX) and should be assigned to the "drill-git" group. + * If a committer requests changes, set the issue status to 'Resume Progress', then once you're ready, submit an updated patch with necessary fixes and then request another round of review with 'Submit Patch' again. + * Once your patch is accepted, be sure to upload a final version which grants rights to the ASF. + +## Where is a good place to start contributing? + +After getting the source code, building and running a few simple queries, one +of the simplest places to start is to implement a DrillFunc. +DrillFuncs is way that Drill express all scalar functions (UDF or system). +First you can put together a JIRA for one of the DrillFunc's we don't yet have +but should (referencing the capabilities of something like Postgres +or SQL Server). Then try to implement one. + +One example DrillFunc: + +[https://github.com/apache/incubator- +drill/blob/103072a619741d5e228fdb181501ec2f82e111a3/sandbox/prototype/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/ComparisonFunction +s.java](https://github.com/apache/incubator- +drill/blob/103072a619741d5e228fdb181501ec2f82e111a3/sandbox/prototype/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/ComparisonFunction +s.java) + +Also one can visit the JIRA issues and implement one of those too. A list of +functions which need to be implemented can be found +[here](https://docs.google.com/spreadsheet/ccc?key=0AgAGbQ6asvQ- +dDRrUUxVSVlMVXRtanRoWk9JcHgteUE&usp=sharing#gid=0) (WIP). + +More contribution ideas are located on the [Contribution Ideas](/drill/docs/apache-drill-contribution-ideas) page. + +### Contributing your work + +Finally, patches should be _attached_ to an issue report in +[JIRA](http://issues.apache.org/jira/browse/DRILL) via the **Attach File** +link on the issue's JIRA. Please add a comment that asks for a code review. +Please note that the attachment should be granted license to ASF for inclusion +in ASF works (as per the [Apache +License](http://www.apache.org/licenses/LICENSE-2.0)). + +Folks should run `mvn clean install` before submitting a patch. Tests should +all pass. If your patch involves performance optimizations, they should be +validated by benchmarks that demonstrate an improvement. + +If your patch creates an incompatibility with the latest major release, then +you must set the **Incompatible change** flag on the issue's JIRA 'and' fill +in the **Release Note** field with an explanation of the impact of the +incompatibility and the necessary steps users must take. + +If your patch implements a major feature or improvement, then you must fill in +the **Release Note** field on the issue's JIRA with an explanation of the +feature that will be comprehensible by the end user. + +A committer should evaluate the patch within a few days and either: commit it; +or reject it with an explanation. + +Please be patient. Committers are busy people too. If no one responds to your +patch after a few days, please make friendly reminders. Please incorporate +other's suggestions into your patch if you think they're reasonable. Finally, +remember that even a patch that is not committed is useful to the community. + +Should your patch receive a "-1" select the **Resume Progress** on the issue's +JIRA, upload a new patch with necessary fixes, and then select the **Submit +Patch** link again. + +Committers: for non-trivial changes, it is best to get another committer to +review your patches before commit. Use **Submit Patch** link like other +contributors, and then wait for a "+1" from another committer before +committing. Please also try to frequently review things in the patch queue. + +### JIRA Guidelines + +Please comment on issues in JIRA, making their concerns known. Please also +vote for issues that are a high priority for you. + +Please refrain from editing descriptions and comments if possible, as edits +spam the mailing list and clutter JIRA's "All" display, which is otherwise +very useful. Instead, preview descriptions and comments using the preview +button (on the right) before posting them. Keep descriptions brief and save +more elaborate proposals for comments, since descriptions are included in +JIRA's automatically sent messages. If you change your mind, note this in a +new comment, rather than editing an older comment. The issue should preserve +this history of the discussion. + +### See Also + + * [Apache contributor documentation](http://www.apache.org/dev/contributors.html) + * [Apache voting documentation](http://www.apache.org/foundation/voting.html) + http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/contribute/002-ideas.md ---------------------------------------------------------------------- diff --git a/_docs/contribute/002-ideas.md b/_docs/contribute/002-ideas.md new file mode 100644 index 0000000..2270112 --- /dev/null +++ b/_docs/contribute/002-ideas.md @@ -0,0 +1,158 @@ +--- +title: "Apache Drill Contribution Ideas" +parent: "Contribute to Drill" +--- + * Fixing JIRAs + * SQL functions + * Support for new file format readers/writers + * Support for new data sources + * New query language parsers + * Application interfaces + * BI Tool testing + * General CLI improvements + * Eco system integrations + * MapReduce + * Hive views + * YARN + * Spark + * Hue + * Phoenix + +## Fixing JIRAs + +This is a good place to begin if you are new to Drill. Feel free to pick +issues from the Drill JIRA list. When you pick an issue, assign it to +yourself, inform the team, and start fixing it. + +For any questions, seek help from the team by sending email to [drill- +dev@incubator.apache.org](mailto:drill-dev@incubator.apache.org). + +[https://issues.apache.org/jira/browse/DRILL/?selectedTab=com.atlassian.jira +.jira-projects-plugin:summary-panel](https://issues.apache.org/jira/browse/DRILL/?selectedTab=com.atlassian.jira +.jira-projects-plugin:summary-panel) + +## SQL functions + +One of the next simple places to start is to implement a DrillFunc.
DrillFuncs +is way that Drill express all scalar functions (UDF or system).
 First you can +put together a JIRA for one of the DrillFunc's we don't yet have but should +(referencing the capabilities of something like Postgres
or SQL Server or your +own use case). Then try to implement one. + +One example DrillFunc: +[https://github.com/apache/incubator- +drill/blob/103072a619741d5e228fdb181501ec2f82e111a3/sandbox/prototype/exec +/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/ComparisonFunction +s.java](https://github.com/apache/incubator- +drill/blob/103072a619741d5e228fdb181501ec2f82e111a3/sandbox/prototype/exec +/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/ComparisonFunction +s.java)** ** + +**Additional ideas on functions that can be added to SQL support** + + * Madlib integration + * Machine learning functions + * Approximate aggregate functions (such as what is available in BlinkDB) + +## Support for new file format readers/writers + +Currently Drill supports text, JSON and Parquet file formats natively when +interacting with file system. More readers/writers can be introduced by +implementing custom storage plugins. Example formats include below. + + * AVRO + * Sequence + * RC + * ORC + * Protobuf + * XML + * Thrift + * .... + +## Support for new data sources + +Implement custom storage plugins for the following non-Hadoop data sources: + + * NoSQL databases (such as Mongo, Cassandra, Couch etc) + * Search engines (such as Solr, Lucidworks, Elastic Search etc) + * SQL databases (MySQL< PostGres etc) + * Generic JDBC/ODBC data sources + * HTTP URL + * \---- + +## New query language parsers + +Drill exposes strongly typed JSON APIs for logical and physical plans (plan +syntax at [https://docs.google.com/a/maprtech.com/document/d/1QTL8warUYS2KjldQ +rGUse7zp8eA72VKtLOHwfXy6c7I/edit#heading=h.n9gdb1ek71hf](https://docs.google.com/a/maprtech.com/document/d/1QTL8warUYS2KjldQ +rGUse7zp8eA72VKtLOHwfXy6c7I/edit#heading=h.n9gdb1ek71hf) ). Drill provides a +SQL language parser today, but any language parser that can generate +logical/physical plans can use Drill's power on the backend as the distributed +low latency query execution engine along with its support for self-describing +data and complex/multi-structured data. + + * Pig parser : Use Pig as the language to query data from Drill. Great for existing Pig users. + * Hive parser : Use HiveQL as the language to query data from Drill. Great for existing Hive users. + +## Application interfaces + +Drill currently provides JDBC/ODBC drivers for the applications to interact +along with a basic version of REST API and a C++ API. The following list +provides a few possible application interface opportunities: + + * Enhancements to REST APIs () + * Expose Drill tables/views as REST APIs + * Language drivers for Drill (python etc) + * Thrift support + * .... + +### BI Tool testing + +Drill provides JDBC/ODBC drivers to connect to BI tools. We need to make sure +Drill works with all major BI tools. Doing a quick sanity testing with your +favorite BI tool is a good place to learn Drill and also uncover issues in +being able to do so. + +## General CLI improvements + +Currently Drill uses SQLLine as the CLI. The goal of this effort is to improve +the CLI experience by adding functionality such as execute statements from a +file, output results to a file, display version information, and so on. + +## Eco system integrations + +### MapReduce + +Allow using result set from Drill queries as input to the Hadoop/MapReduce +jobs. + +### Hive views + +Query data from existing Hive views using Drill queries. Drill needs to parse +the HiveQL and translate them appropriately (into Drill's SQL or +logical/physical plans) to execute the requests. + +### YARN + +[https://issues.apache.org/jira/browse/_DRILL_-1170](https://issues.apache.org +/jira/browse/DRILL-1170) + +## Spark + +Provide ability to invoke Drill queries as part of Apache Spark programs. This +gives ability for Spark developers/users to leverage Drill richness of the +query layer , for data source access and as low latency execution engine. + +### Hue + +Hue is a GUI for users to interact with various Hadoop eco system components +(such as Hive, Oozie, Pig, HBase, Impala ...). The goal of this project is to +expose Drill as an application inside Hue so users can explore Drill metadata +and do SQL queries. + +### Phoenix + +Phoenix provides a low latency query layer on HBase for operational +applications. The goal of this effort is to explore opportunities for +integrating Phoenix with Drill. + http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/datasets/001-aol.md ---------------------------------------------------------------------- diff --git a/_docs/datasets/001-aol.md b/_docs/datasets/001-aol.md new file mode 100644 index 0000000..472f52f --- /dev/null +++ b/_docs/datasets/001-aol.md @@ -0,0 +1,47 @@ +--- +title: "AOL Search" +parent: "Sample Datasets" +--- +## Quick Stats + +The [AOL Search dataset](http://en.wikipedia.org/wiki/AOL_search_data_leak) is +a collection of real query log data that is based on real users. + +## The Data Source + +The dataset consists of 20M Web queries from 650k users over a period of three +months, 440MB in total and available [for +download](http://zola.di.unipi.it/smalltext/datasets.html). The format used in +the dataset is: + + AnonID, Query, QueryTime, ItemRank, ClickURL + +... with: + + * AnonID, an anonymous user ID number. + * Query, the query issued by the user, case shifted with most punctuation removed. + * QueryTime, the time at which the query was submitted for search. + * ItemRank, if the user clicked on a search result, the rank of the item on which they clicked is listed. + * [ClickURL](http://www.dietkart.com/), if the user clicked on a search result, the domain portion of the URL in the clicked result is listed. + +Each line in the data represents one of two types of events + + * A query that was NOT followed by the user clicking on a result item. + * A click through on an item in the result list returned from a query. + +In the first case (query only) there is data in only the first three columns, +in the second case (click through), there is data in all five columns. For +click through events, the query that preceded the click through is included. +Note that if a user clicked on more than one result in the list returned from +a single query, there will be TWO lines in the data to represent the two +events. + +## The Queries + +Interesting queries, for example + + * Users querying for topic X + * Users that click on the first (second, third) ranked item + * TOP 10 domains searched + * TOP 10 domains clicked at + http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/datasets/002-enron.md ---------------------------------------------------------------------- diff --git a/_docs/datasets/002-enron.md b/_docs/datasets/002-enron.md new file mode 100644 index 0000000..9883382 --- /dev/null +++ b/_docs/datasets/002-enron.md @@ -0,0 +1,19 @@ +--- +title: "Enron Emails" +parent: "Sample Datasets" +--- +## Quick Stats + +The [Enron Email dataset](http://www.cs.cmu.edu/~enron/) contains data from +about 150 users, mostly senior management of Enron. + +## The Data Source + +Totalling some 500,000 messages, the [raw +data](http://www.cs.cmu.edu/~enron/enron_mail_20110402.tgz) (2009 version of +the dataset; ~423MB) is available for download as well as a [MySQL +dump](ftp://ftp.isi.edu/sims/philpot/data/enron-mysqldump.sql.gz) (~177MB). + +## The Queries + +Interesting queries, for example Via [Query Dataset for Email Search](https://dbappserv.cis.upenn.edu/spell/) \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/datasets/003-wikipedia.md ---------------------------------------------------------------------- diff --git a/_docs/datasets/003-wikipedia.md b/_docs/datasets/003-wikipedia.md new file mode 100644 index 0000000..93da4bf --- /dev/null +++ b/_docs/datasets/003-wikipedia.md @@ -0,0 +1,105 @@ +--- +title: "Wikipedia Edit History" +parent: "Sample Datasets" +--- +## Quick Stats + +The Wikipedia Edit History is a public dump of the website made available by +the wikipedia foundation. You can find details +[here](http://en.wikipedia.org/wiki/Wikipedia:Database_download). The dumps +are made available as SQL or XML dumps. You can find the entire schema drawn +together in this great [diagram](http://upload.wikimedia.org/wikipedia/commons +/thumb/4/42/MediaWiki_1.20_%2844edaa2%29_database_schema.svg/2193px- +MediaWiki_1.20_%2844edaa2%29_database_schema.svg.png). + +## Approach + +The _main_ distribution files are: + + * Current Pages: As of January 2013 this SQL dump was 9.0GB in its compressed format. + * Complere Archive: This is what we actually want, but at a size of multiple terrabytes, clearly exceeds the storage available at home. + +To have some real historic data, it is recommended to download a _Special +Export_ use this +[link](http://en.wikipedia.org/w/index.php?title=Special:Export). Using this +tool you generate a category specific XML dump and configure various export +options. There are some limits like a maximum of 1000 revisions per export, +but otherwise this should work out just fine. + +![drill query flow]({{ site.baseurl }}/docs/img/Overview.png) + +The entities used in the query use cases. + +## Use Cases + +### Select Change Volume Based on Time + +**Query** + + select rev.::parent.title, rev.::parent.id, sum(rev.text.bytes) + from mediawiki.page.revision as rev + where rev.timestamp.between(?, ?) + group by rev.::parent; + +_Explanation_: This is my attempt in mixing records and structures. The `from` +statement refers to `mediawiki` as a record type / row, but also mixes in +structural information, i.e. `page.revision`, internal to the record. The +query now uses `page.revision` as base to all other statements, in this case +the `select`, `where` and the `goup by`. The `where` statement again uses a +JSON like expression to state, that the timestamp must be between two values, +paramaeters are written as question marks, similar to JDBC. The `group by` +statement instructs the query to aggregate results based on the parent of a +`revision`, in this case a `page`. The `::parent` syntax is borrowed from +XPath. As we are aggregating on `page` it is safe to select the `title` and +`id` from the element in the `select`. We also use an aggregation function to +add the number of bytes changed in the given time frame, this should be self +explanatory. + +_Discussion_: + + * I am not very satisfied using the `::` syntax, as it is _ugly_. We probably wont need that many axis specifiers, e.g. we dont need any attribute specifiers, but for now, I could not think of anything better, + * Using an `as` expression in the `from` statement is optional, you would simply have to replace all references to `rev` with `revision`. + * I am not sure if this is desired, but you cannot see on first glance, where the _hierarchical_ stuff starts. This may be confusing to a RDBMS purist, at least it was for me at the beginning. But now I think this strikes the right mix between verbosity and elegance. + * I assume we would need some good indexing, but this should be achievable. We would need to translate the relative index `rev.timestamp` to an record absolute index `$.mediawiki.page.revision.timestamp` . Unclear to me now is whether the index would point to the record, or would it point to some kind of record substructure? + +### Select Change Volume Aggregated on Time + +**Query** + + select rev.::parent.title, rev.::parent.id, sum(rev.text.bytes), rev.timestamp.monthYear() + from mediawiki.page.revision as rev + where rev.timestamp.between(?, ?) + group by rev.::parent, rev.timestamp.monthYear() + order by rev.::parent.id, rev.timestamp.monthYear(); + +_Explanation_: This is refinement of the previous query. In this case we are +again returning a flat list, but are using an additional scalar result and +`group` statement. In the previous example we were returning one result per +found page, now we are returning one result per page and month of changes. +`Order by` is nothing special, in this case. + +_Discussion_: + + * I always considered mySQL confusing using implicit group by statements, as I prefer fail fast mechanisms. Hence I would opt for explicit `group by` operators. + * I would not provide implicit nodes into the records, i.e. if you want some attribute of a timestamp, call a function and not expect an automatically added element. So we want `rev.timestamp.monthYear()` and not `rev.timestamp.monthYear`. This may be quite confusing, especially if we have heterogenous record structures. We might even go ahead and support namespaces for custom, experimental features like `rev.timestamp.custom.maya:doomsDay()`. + +### Select Change Volume Based on Contributor + +**Query** + + select ctrbr.username, ctbr.ip, ctbr.userid, sum(ctbr::parent.bytes) as bytesContributed + from mediawiki.page..contributor as ctbr + group by ctbr.canonize() + order by bytesContributed; + +_Explanation_: This query looks quite similar to the previous queries, but I +added this one nonetheless, as it hints on an aggregation which may spawn +multiple records. The previous examples were based on pages, which are unique +to a record, where as the contributor may appear many times in many different +records. + +_Discussion_: + + * I have added the `..` operator in this example. Besides of being syntactic sugar, it also allows us to search for `revision` and `upload` which are both children of `page` and may both have a `contributor`. The more RBMS like alternative would be a `union`, but this was not natural enough. + * I am sure the `ctbr.canonize()` will cause lots of discussions :-). The thing is, that a contributor may repeat itself in many different records, and we dont really have an id. If you look at the wikimedia XSD, all three attributes are optional, and the data says the same, so we cannot just simply say `ctbr.userid`. Hence the canonize function should create a scalar value containing all available information of the node in a canonical form. + * Last but not least, I always hated, that mySQL would not be able to reuse column definitions from the `select` statement in the `order` statements. So I added on my wishlist, that the `bytesContributed` definition is reusable. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/design/001-plan.md ---------------------------------------------------------------------- diff --git a/_docs/design/001-plan.md b/_docs/design/001-plan.md new file mode 100644 index 0000000..67e2290 --- /dev/null +++ b/_docs/design/001-plan.md @@ -0,0 +1,25 @@ +--- +title: "Drill Plan Syntax" +parent: "Design Docs" +--- +### Whats the plan? + +This section is about the end-to-end plan flow for Drill. The incoming query +to Drill can be a SQL 2003 query/DrQL or MongoQL. The query is converted to a +_Logical Plan_ that is a Drill's internal representation of the query +(language-agnostic). Drill then uses its optimization rules over the Logical +Plan to optimize it for best performance and crafts out a _Physical Plan_. The +Physical Plan is the actual plan the Drill then executes for the final data +processing. Below is a diagram to illustrate the flow: + +![drill query flow]({{ site.baseurl }}/docs/img/slide-15-638.png) + +**The Logical Plan** describes the abstract data flow of a language independent query i.e. it would be a representation of the input query which would not be dependent on the actual input query language. It generally tries to work with primitive operations without focus on optimization. This makes it more verbose than traditional query languages. This is to allow a substantial level of flexibility in defining higher-level query language features. It would be forwarded to the optimizer to get a physical plan. + +**The Physical Plan** is often called the execution plan, since it is the input to the execution engine. Its a description of the physical operations the execution engine will undertake to get the desired result. It is the output of the query planner and is a transformation of the logical plan after applying the optimization rules. + +Typically, the physical and execution plans will be represented using the same +JSON format as the logical plan. + +**Detailed document**: Here is a document that explains the Drill logical & physical plans in full detail. [Drill detailed plan syntax document](https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit). + http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/design/002-rpc.md ---------------------------------------------------------------------- diff --git a/_docs/design/002-rpc.md b/_docs/design/002-rpc.md new file mode 100644 index 0000000..05cb1d6 --- /dev/null +++ b/_docs/design/002-rpc.md @@ -0,0 +1,19 @@ +--- +title: "RPC Overview" +parent: "Design Docs" +--- +Drill leverages the Netty 4 project as an RPC underlayment. From there, we +built a simple protobuf based communication layer optimized to minimize the +requirement for on heap data transformations. Both client and server utilize +the CompleteRpcMessage protobuf envelope to communicate requests, responses +and errors. The communication model is that each endpoint sends a stream of +CompleteRpcMessages to its peer. The CompleteRpcMessage is prefixed by a +protobuf encoded length. + +CompleteRpcMessage is broken into three key components: RpcHeader, Protobuf +Body (bytes), RawBody (bytes). + +RpcHeader has the following fields: + +Drillbits communicate through the BitCom intermediary. BitCom manages... + http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/design/003-query-stages.md ---------------------------------------------------------------------- diff --git a/_docs/design/003-query-stages.md b/_docs/design/003-query-stages.md new file mode 100644 index 0000000..5c54249 --- /dev/null +++ b/_docs/design/003-query-stages.md @@ -0,0 +1,42 @@ +--- +title: "Query Stages" +parent: "Design Docs" +--- +## Overview + +Apache Drill is a system for interactive analysis of large-scale datasets. It +was designed to allow users to query across multiple large big data systems +using traditional query technologies such as SQL. It is built as a flexible +framework to support a wide variety of data operations, query languages and +storage engines. + +## Query Parsing + +A Drillbit is capable of parsing a provided query into a logical plan. In +theory, Drill is capable of parsing a large range of query languages. At +launch, this will likely be restricted to an enhanced SQL2003 language. + +## Physical Planning + +Once a query is parsed into a logical plan, a Drillbit will then translate the +plan into a physical plan. The physical plan will then be optimized for +performance. Since plan optimization can be computationally intensive, a +distributed in-memory cache will provide LRU retrieval of previously generated +optimized plans to speed query execution. + +## Execution Planning + +Once a physical plan is generated, the physical plan is then rendered into a +set of detailed executional plan fragments (EPFs). This rendering is based on +available resources, cluster load, query priority and detailed information +about data distribution. In the case of large clusters, a subset of nodes will +be responsible for rendering the EPFs. Shared state will be managed through +the use of a distributed in-memory cache. + +## Execution Operation + +Query execution starts with each Drillbit being provided with one or more EPFs +associated with query execution. A portion of these EPFs may be identified as +initial EPFs and thus they are executed immediately. Other EPFs are executed +as data flows into them. + http://git-wip-us.apache.org/repos/asf/drill/blob/d959a210/_docs/design/004-research.md ---------------------------------------------------------------------- diff --git a/_docs/design/004-research.md b/_docs/design/004-research.md new file mode 100644 index 0000000..77be828 --- /dev/null +++ b/_docs/design/004-research.md @@ -0,0 +1,48 @@ +--- +title: "Useful Research" +parent: "Design Docs" +--- +## Drill itself + + * Apache Proposal: + * Mailing List Archive: + * DrQL ANTLR grammar: + * Apache Drill, Architecture outlines: + +## Background info + + * Dremel Paper: + * Dremel Presentation: + * Query Language: + * Protobuf: + * Dryad: + * SQLServer Query Plan: + * CStore: + * Vertica (commercial evolution of C-Store): + * + * + * + * Hive Architecture: + * Fast Response in an unreliable world: + * Column-Oriented Database Systems: (SLIDES: ) + +## OpenDremel + + * OpenDremel site: + * Design Proposal for Drill: + +## Dazo (second generation OpenDremel) + + * Dazo repos: + * ZeroVM (multi-tenant executor): + * ZeroVM elaboration: + +## Rob Grzywinski Dremel adventures + + * + +## Code generation / Physical plan generation + + * (SLIDES: ) + * (SLIDES: ) +