Return-Path: X-Original-To: apmail-drill-commits-archive@www.apache.org Delivered-To: apmail-drill-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 230FC1878B for ; Mon, 18 May 2015 23:36:25 +0000 (UTC) Received: (qmail 57567 invoked by uid 500); 18 May 2015 23:36:25 -0000 Delivered-To: apmail-drill-commits-archive@drill.apache.org Received: (qmail 57465 invoked by uid 500); 18 May 2015 23:36:24 -0000 Mailing-List: contact commits-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: commits@drill.apache.org Delivered-To: mailing list commits@drill.apache.org Received: (qmail 57320 invoked by uid 99); 18 May 2015 23:36:24 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 May 2015 23:36:24 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id C1133E0ACF; Mon, 18 May 2015 23:36:24 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: tshiran@apache.org To: commits@drill.apache.org Date: Mon, 18 May 2015 23:36:30 -0000 Message-Id: <7b9f5352b7d647f29e84c7791e4f546b@git.apache.org> In-Reply-To: <408f81be21da42ada717f5460685b641@git.apache.org> References: <408f81be21da42ada717f5460685b641@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [07/31] drill git commit: add perf tune structure and links, drillbit_queries.json, remove videos http://git-wip-us.apache.org/repos/asf/drill/blob/cd0a0e96/_docs/getting-started/020-why-drill.md ---------------------------------------------------------------------- diff --git a/_docs/getting-started/020-why-drill.md b/_docs/getting-started/020-why-drill.md index d00d882..2e57b88 100644 --- a/_docs/getting-started/020-why-drill.md +++ b/_docs/getting-started/020-why-drill.md @@ -7,27 +7,25 @@ parent: "Getting Started" ### 1. Get started in minutes -It only takes a couple of minutes to start working with Drill. Untar it on your Mac or Windows laptop and run a query on a local file. No need to set up any infrastructure. No need to define schemas. Just point at the data and drill! +It takes a couple of minutes to start working with Drill. Untar the Drill software on your Mac or Windows laptop and run a query on a local file. No need to set up any infrastructure or to define schemas. Just point to the data, such as data in a file, directory, HBase table, and drill. $ tar -xvf apache-drill-.tar.gz - $ /bin/sqlline -u jdbc:drill:zk=local - 0: jdbc:drill:zk=local> USE cp; - 0: jdbc:drill:zk=local> SELECT * FROM employee.json limit 5; - +-------------+------------------+------------+------------+-------------+----------------------+------------+---------------+----- - | employee_id | full_name | first_name | last_name | position_id | position_title | store_id | department_id | birt - +-------------+------------------+------------+------------+-------------+----------------------+------------+---------------+------+ - | 1 | Sheri Nowmer | Sheri | Nowmer | 1 | President | 0 | 1 | 19 - | 2 | Derrick Whelply | Derrick | Whelply | 2 | VP Country Manager | 0 | 1 | - | 4 | Michael Spence | Michael | Spence | 2 | VP Country Manager | 0 | 1 | - | 5 | Maya Gutierrez | Maya | Gutierrez | 2 | VP Country Manager | 0 | 1 | - | 6 | Roberta Damstra | Roberta | Damstra | 3 | VP Information Systems | 0 | 2 | - +-------------+------------------+------------+------------+-------------+----------------------+------------+---------------+----- + $ /bin/drill-embedded + 0: jdbc:drill:zk=local> SELECT * FROM cp.`employee.json` LIMIT 5; + +--------------+----------------------------+---------------------+---------------+--------------+----------------------------+-----------+----------------+-------------+------------------------+----------+----------------+----------------------+-----------------+---------+-----------------------+ + | employee_id | full_name | first_name | last_name | position_id | position_title | store_id | department_id | birth_date | hire_date | salary | supervisor_id | education_level | marital_status | gender | management_role | + +--------------+----------------------------+---------------------+---------------+--------------+----------------------------+-----------+----------------+-------------+------------------------+----------+----------------+----------------------+-----------------+---------+-----------------------+ + | 1 | Sheri Nowmer | Sheri | Nowmer | 1 | President | 0 | 1 | 1961-08-26 | 1994-12-01 00:00:00.0 | 80000.0 | 0 | Graduate Degree | S | F | Senior Management | + | 2 | Derrick Whelply | Derrick | Whelply | 2 | VP Country Manager | 0 | 1 | 1915-07-03 | 1994-12-01 00:00:00.0 | 40000.0 | 1 | Graduate Degree | M | M | Senior Management | + | 4 | Michael Spence | Michael | Spence | 2 | VP Country Manager | 0 | 1 | 1969-06-20 | 1998-01-01 00:00:00.0 | 40000.0 | 1 | Graduate Degree | S | M | Senior Management | + | 5 | Maya Gutierrez | Maya | Gutierrez | 2 | VP Country Manager | 0 | 1 | 1951-05-10 | 1998-01-01 00:00:00.0 | 35000.0 | 1 | Bachelors Degree | M | F | Senior Management | + ## 2. Schema-free JSON model -Drill is the world's first and only distributed SQL engine that doesn't require schemas. It shares the same schema-free JSON model as MongoDB and Elasticsearch. Instead of spending weeks or months defining schemas, transforming data (ETL) and maintaining those schemas, simply point Drill at your data (file, directory, HBase table, etc.) and run your queries. Drill automatically understands the structure of the data. Drill's self-service approach reduces the burden on IT and increases the productivity and agility of analysts and developers. +Drill is the world's first and only distributed SQL engine that doesn't require schemas. It shares the same schema-free JSON model as MongoDB and Elasticsearch. No need to define and maintain schemas or transform data (ETL). Drill automatically understands the structure of the data. ## 3. Query complex, semi-structured data in-situ -Drill's schema-free JSON model allows you to query complex, semi-structured data in situ. No need to flatten or transform the data prior to or during query execution. Drill also provides intuitive extensions to SQL to work with nested data. Here's a simple query on a JSON file demonstrating how to access nested elements and arrays: +Using Drill's schema-free JSON model, you can query complex, semi-structured data in situ. No need to flatten or transform the data prior to or during query execution. Drill also provides intuitive extensions to SQL to work with nested data. Here's a simple query on a JSON file demonstrating how to access nested elements and arrays: SELECT * FROM (SELECT t.trans_id, t.trans_info.prod_id[0] AS prod_id, @@ -56,7 +54,7 @@ Drill supports the standard SQL:2003 syntax. No need to learn a new "SQL-like" l ORDER BY o.o_orderpriority; ## 5. Leverage standard BI tools -Drill works with standard BI tools. You can keep using the tools you love, such as Tableau, MicroStrategy, QlikView and Excel. No need to introduce yet another visualization or dashboard tool. Combine a self-service BI tool with the only self-service SQL engine to enable true self-service data exploration. +Drill works with standard BI tools. You can use your existing tools, such as Tableau, MicroStrategy, QlikView and Excel. ## 6. Interactive queries on Hive tables Apache Drill lets you leverage your investments in Hive. You can run interactive queries with Drill on your Hive tables and access all Hive input/output formats (including custom SerDes). You can join tables associated with different Hive metastores, and you can join a Hive table with an HBase table or a directory of log files. Here's a simple query in Drill on a Hive table: @@ -68,7 +66,7 @@ Apache Drill lets you leverage your investments in Hive. You can run interactive ## 7. Access multiple data sources -Drill is designed with extensibility in mind. It provides out-of-the-box connectivity to file systems (local or distributed file systems such as S3, HDFS and MapR-FS), HBase and Hive. You can implement a storage plugin to make Drill work with any other data source. Drill can combine data from multiple data sources on the fly in a single query, with no centralized metadata definitions. Here's a query that combines data from a Hive table, an HBase table (view) and a JSON file: +Drill is extensible. You can connect Drill out-of-the-box to file systems (local or distributed, such as S3, HDFS and MapR-FS), HBase and Hive. You can implement a storage plugin to make Drill work with any other data source. Drill can combine data from multiple data sources on the fly in a single query, with no centralized metadata definitions. Here's a query that combines data from a Hive table, an HBase table (view) and a JSON file: SELECT custview.membership, sum(orders.order_total) AS sales FROM hive.orders, custview, dfs.`clicks/clicks.json` c @@ -81,7 +79,7 @@ Drill exposes a simple and high-performance Java API to build custom functions ( ## 9. High performance -Drill is designed from the ground up for high throughput and low latency. It doesn't use a general purpose execution engine like MapReduce, Tez or Spark. As a result, Drill is able to deliver its unparalleled flexibility (schema-free JSON model) without compromising performance. Drill's optimizer leverages rule- and cost-based techniques, as well as data locality and operator push-down (the ability to push down query fragments into the back-end data sources). Drill also provides a columnar and vectorized execution engine, resulting in higher memory and CPU efficiency. +Drill is designed from the ground up for high throughput and low latency. It doesn't use a general purpose execution engine like MapReduce, Tez or Spark. As a result, Drill is flexible (schema-free JSON model) and performant. Drill's optimizer leverages rule- and cost-based techniques, as well as data locality and operator push-down, which is the capability to push down query fragments into the back-end data sources. Drill also provides a columnar and vectorized execution engine, resulting in higher memory and CPU efficiency. ## 10. Scales from a single laptop to a 1000-node cluster -Drill is available as a simple download you can run on your laptop. When you're ready to analyze larger datasets, simply deploy Drill on your Hadoop cluster (up to 1000 commodity servers). Drill leverages the aggregate memory in the cluster to execute queries using an optimistic pipelined model, and automatically spills to disk when the working set doesn't fit in memory. +Drill is available as a simple download you can run on your laptop. When you're ready to analyze larger datasets, deploy Drill on your Hadoop cluster (up to 1000 commodity servers). Drill leverages the aggregate memory in the cluster to execute queries using an optimistic pipelined model, and automatically spills to disk when the working set doesn't fit in memory. http://git-wip-us.apache.org/repos/asf/drill/blob/cd0a0e96/_docs/performance-tuning/010-performance-tuning-introduction.md ---------------------------------------------------------------------- diff --git a/_docs/performance-tuning/010-performance-tuning-introduction.md b/_docs/performance-tuning/010-performance-tuning-introduction.md new file mode 100644 index 0000000..1dffb56 --- /dev/null +++ b/_docs/performance-tuning/010-performance-tuning-introduction.md @@ -0,0 +1,17 @@ +--- +title: "Performance Tuning Introduction" +parent: "Performance Tuning" +--- +You can apply performance tuning measures to improve how efficiently Drill queries data. To significantly improve performance in Drill, you must have knowledge about the underlying data and data sources, as well as familiarity with how Drill executes queries. + +You can analyze query plans and profiles to identify the source of performance issues in Drill. Once you have isolated the source of an issue, you can apply the following tuning techniques to improve query performance: + +* Modify query planning options +* Modify broadcast join options +* Switch between 1 or 2 phase aggregation +* Enable/disable hash-based memory-constrained operators +* Enable query queuing +* Control parallelization +* Organize data for partition pruning +* Change storage formats +* Disable Logging (See Logging and Debugging)