drill-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From a..@apache.org
Subject [1/2] drill git commit: Added link to my talk at Portland Big Data User Group.
Date Mon, 01 Dec 2014 01:42:47 GMT
Repository: drill
Updated Branches:
  refs/heads/gh-pages a73cdec7f -> 7afc003a1


http://git-wip-us.apache.org/repos/asf/drill/blob/7afc003a/why.html
----------------------------------------------------------------------
diff --git a/why.html b/why.html
index 654c755..9bb1232 100755
--- a/why.html
+++ b/why.html
@@ -1,78 +1,78 @@
----
-layout: page
-title: Why Drill
-description: Top 10 Reasons to Use Drill
----
-<h2>1. Get started in minutes</h2>
-<p>It only takes a couple minutes to start working with Drill. Untar it on your Mac
or Windows laptop and run a query on a local file. No need to set up any infrastructure. No
need to define schemas. Just point at the data and drill!</p>
-{% highlight bash %}
-$ tar -xvf apache-drill-0.6.0-incubating.tar.gz
-$ apache-drill-0.6.0-incubating/bin/sqlline -u jdbc:drill:zk=local
-0: jdbc:drill:zk=local> SELECT * FROM dfs.root.`path/to/employee.json` limit 5;
-+-------------+------------------+------------+------------+-------------+----------------------+------------+---------------+-----
-| employee_id | full_name        | first_name | last_name  | position_id | position_title
      |  store_id  | department_id | birt 
-+-------------+------------------+------------+------------+-------------+----------------------+------------+---------------+------+
-| 1           | Sheri Nowmer     | Sheri      | Nowmer     | 1           | President    
       | 0          | 1             | 19   
-| 2           | Derrick Whelply  | Derrick    | Whelply    | 2           | VP Country Manager
  | 0          | 1             |
-| 4           | Michael Spence   | Michael    | Spence     | 2           | VP Country Manager
  | 0          | 1             |
-| 5           | Maya Gutierrez   | Maya       | Gutierrez  | 2           | VP Country Manager
  | 0          | 1             |
-| 6           | Roberta Damstra  | Roberta    | Damstra    | 3           | VP Information
Systems | 0        | 2             |
-+-------------+------------------+------------+------------+-------------+----------------------+------------+---------------+-----
-{% endhighlight %}
-<h2>2. Schema-free JSON model</h2>
-<p>Drill is the world's first and only distributed SQL engine that doesn't require
schemas. It shares the same schema-free JSON model as MongoDB and Elasticsearch. Instead of
spending weeks or months defining schemas, transforming data (ETL) and maintaining those schemas,
simply point Drill at your data (file, directory, HBase table, etc.) and run your queries.
Drill automatically understands the structure of the data. Drill's self-service approach reduces
the burden on IT and increases the productivity and agility of analysts and developers.</p>
-
-<h2>3. Query complex, semi-structured data in-situ</h2><p>Drill's schema-free
JSON model allows you to query complex, semi-structured data in situ. No need to flatten or
transform the data prior to or during query execution. Drill also provides intuitive extensions
to SQL to work with nested data. Here's a simple query on a JSON file demonstrating how to
access nested elements and arrays:</p>
-{% highlight sql %}
-SELECT * FROM (SELECT t.trans_id,
-                      t.trans_info.prod_id[0] AS prod_id,
-                      t.trans_info.purch_flag AS purchased
-               FROM `clicks/clicks.json` t) sq
-WHERE sq.prod_id BETWEEN 700 AND 750 AND
-      sq.purchased = 'true'
-ORDER BY sq.prod_id;
-{% endhighlight %}
-
-<h2>4. Real SQL - not "SQL-like"</h2>
-<p>Drill supports the standard SQL:2003 syntax. No need to learn a new "SQL-like" language
or struggle with a semi-functional BI tool. Drill supports many data types including DATE,
INTERVAL, TIMESTAMP, VARCHAR and DECIMAL, as well as complex query constructs such as correlated
sub-queries and joins in WHERE clauses. Here is an example of a TPC-H standard query that
runs in Drill "as is":</p>
-{% highlight sql %}
-# TPC-H query 4
-SELECT  o.o_orderpriority, count(*) AS order_count
-FROM orders o
-WHERE o.o_orderdate >= date '1996-10-01'
-      AND o.o_orderdate < date '1996-10-01' + interval '3' month
-      AND EXISTS(
-                 SELECT * FROM lineitem l 
-                 WHERE l.l_orderkey = o.o_orderkey
-                 AND l.l_commitdate < l.l_receiptdate
-                 )
-      GROUP BY o.o_orderpriority
-      ORDER BY o.o_orderpriority;
-{% endhighlight %}
-
-<h2>5. Leverage standard BI tools</h2>
-<p>Drill works with standard BI tools. You can keep using the tools you love, such
as Tableau, MicroStrategy, QlikView and Excel. No need to introduce yet another visualization
or dashboard tool. Combine a self-service BI tool with the only self-service SQL engine to
enable true self-service data exploration.</p>
-
-<h2>6. Interactive queries on Hive tables</h2><p>Apache Drill lets you
leverage your investments in Hive. You can run interactive queries with Drill on your Hive
tables and access all Hive input/output formats (including custom SerDes). You can join tables
associated with different Hive metastores, and you can join a Hive table with an HBase table
or a directory of log files. Here's a simple query in Drill on a Hive table:</p>
-{% highlight sql %}
-SELECT `month`, state, sum(order_total) AS sales
-FROM hive.orders 
-GROUP BY `month`, state
-ORDER BY 3 DESC LIMIT 5;
-{% endhighlight %}
-
-<h2>7. Access multiple data sources</h2><p>Drill is designed with extensibility
in mind. It provides out-of-the-box connectivity to file systems (local or distributed file
systems such as S3, HDFS and MapR-FS), HBase and Hive. You can implement a storage plugin
to make Drill work with any other data source. Drill can combine data from multiple data sources
on the fly in a single query, with no centralized metadata definitions. Here's a query that
combines data from a Hive table, an HBase table (view) and a JSON file:</p>
-{% highlight sql %}
-SELECT custview.membership, sum(orders.order_total) AS sales
-FROM hive.orders, custview, dfs.`clicks/clicks.json` c 
-WHERE orders.cust_id = custview.cust_id AND orders.cust_id = c.user_info.cust_id 
-GROUP BY custview.membership
-ORDER BY 2;
-{% endhighlight %}
-
-<h2>8. User-Defined Functions (UDFs)</h2><p>Drill exposes a simple and
high-performance Java API to build custom functions (UDFs and UDAFs) so that you can add your
own business logic. If you have already built UDFs in Hive, you can reuse them with Drill
with no modifications. Refer to <a href="https://cwiki.apache.org/confluence/display/DRILL/Develop+Custom+Functions">Developing
Custom Functions</a> for more information.
-</p>
-
-<h2>9. High performance</h2><p>Drill is designed fround the ground up for
high throughput and low latency. It doesn't use a general purpose execution engine like MapReduce,
Tez or Spark. As a result, Drill is able to deliver its unparalleled flexibility (schema-free
JSON model) without compromising performance. Drill's optimizer leverages rule- and cost-based
techniques, as well as data locality and operator push-down (the ability to push down query
fragments into the back-end data sources). Drill also provides a columnar and vectorized execution
engine, resulting in higher memory and CPU efficiency.</p>
-
-<h2>10. Scales from a single laptop to a 1000-node cluster</h2><p>Drill
is available as a simple download you can run on your laptop. When you're ready to analyze
larger datasets, simply deploy Drill on your Hadoop cluster (up to 1000 commodity servers).
Drill leverages the aggregate memory in the cluster to execute queries using an optimistic
pipelined model, and automatically spills to disk when the working set doesn't fit in memory.</p>
+---
+layout: page
+title: Why Drill
+description: Top 10 Reasons to Use Drill
+---
+<h2>1. Get started in minutes</h2>
+<p>It only takes a couple minutes to start working with Drill. Untar it on your Mac
or Windows laptop and run a query on a local file. No need to set up any infrastructure. No
need to define schemas. Just point at the data and drill!</p>
+{% highlight bash %}
+$ tar -xvf apache-drill-0.6.0-incubating.tar.gz
+$ apache-drill-0.6.0-incubating/bin/sqlline -u jdbc:drill:zk=local
+0: jdbc:drill:zk=local> SELECT * FROM dfs.root.`path/to/employee.json` limit 5;
++-------------+------------------+------------+------------+-------------+----------------------+------------+---------------+-----
+| employee_id | full_name        | first_name | last_name  | position_id | position_title
      |  store_id  | department_id | birt 
++-------------+------------------+------------+------------+-------------+----------------------+------------+---------------+------+
+| 1           | Sheri Nowmer     | Sheri      | Nowmer     | 1           | President    
       | 0          | 1             | 19   
+| 2           | Derrick Whelply  | Derrick    | Whelply    | 2           | VP Country Manager
  | 0          | 1             |
+| 4           | Michael Spence   | Michael    | Spence     | 2           | VP Country Manager
  | 0          | 1             |
+| 5           | Maya Gutierrez   | Maya       | Gutierrez  | 2           | VP Country Manager
  | 0          | 1             |
+| 6           | Roberta Damstra  | Roberta    | Damstra    | 3           | VP Information
Systems | 0        | 2             |
++-------------+------------------+------------+------------+-------------+----------------------+------------+---------------+-----
+{% endhighlight %}
+<h2>2. Schema-free JSON model</h2>
+<p>Drill is the world's first and only distributed SQL engine that doesn't require
schemas. It shares the same schema-free JSON model as MongoDB and Elasticsearch. Instead of
spending weeks or months defining schemas, transforming data (ETL) and maintaining those schemas,
simply point Drill at your data (file, directory, HBase table, etc.) and run your queries.
Drill automatically understands the structure of the data. Drill's self-service approach reduces
the burden on IT and increases the productivity and agility of analysts and developers.</p>
+
+<h2>3. Query complex, semi-structured data in-situ</h2><p>Drill's schema-free
JSON model allows you to query complex, semi-structured data in situ. No need to flatten or
transform the data prior to or during query execution. Drill also provides intuitive extensions
to SQL to work with nested data. Here's a simple query on a JSON file demonstrating how to
access nested elements and arrays:</p>
+{% highlight sql %}
+SELECT * FROM (SELECT t.trans_id,
+                      t.trans_info.prod_id[0] AS prod_id,
+                      t.trans_info.purch_flag AS purchased
+               FROM `clicks/clicks.json` t) sq
+WHERE sq.prod_id BETWEEN 700 AND 750 AND
+      sq.purchased = 'true'
+ORDER BY sq.prod_id;
+{% endhighlight %}
+
+<h2>4. Real SQL - not "SQL-like"</h2>
+<p>Drill supports the standard SQL:2003 syntax. No need to learn a new "SQL-like" language
or struggle with a semi-functional BI tool. Drill supports many data types including DATE,
INTERVAL, TIMESTAMP, VARCHAR and DECIMAL, as well as complex query constructs such as correlated
sub-queries and joins in WHERE clauses. Here is an example of a TPC-H standard query that
runs in Drill "as is":</p>
+{% highlight sql %}
+# TPC-H query 4
+SELECT  o.o_orderpriority, count(*) AS order_count
+FROM orders o
+WHERE o.o_orderdate >= date '1996-10-01'
+      AND o.o_orderdate < date '1996-10-01' + interval '3' month
+      AND EXISTS(
+                 SELECT * FROM lineitem l 
+                 WHERE l.l_orderkey = o.o_orderkey
+                 AND l.l_commitdate < l.l_receiptdate
+                 )
+      GROUP BY o.o_orderpriority
+      ORDER BY o.o_orderpriority;
+{% endhighlight %}
+
+<h2>5. Leverage standard BI tools</h2>
+<p>Drill works with standard BI tools. You can keep using the tools you love, such
as Tableau, MicroStrategy, QlikView and Excel. No need to introduce yet another visualization
or dashboard tool. Combine a self-service BI tool with the only self-service SQL engine to
enable true self-service data exploration.</p>
+
+<h2>6. Interactive queries on Hive tables</h2><p>Apache Drill lets you
leverage your investments in Hive. You can run interactive queries with Drill on your Hive
tables and access all Hive input/output formats (including custom SerDes). You can join tables
associated with different Hive metastores, and you can join a Hive table with an HBase table
or a directory of log files. Here's a simple query in Drill on a Hive table:</p>
+{% highlight sql %}
+SELECT `month`, state, sum(order_total) AS sales
+FROM hive.orders 
+GROUP BY `month`, state
+ORDER BY 3 DESC LIMIT 5;
+{% endhighlight %}
+
+<h2>7. Access multiple data sources</h2><p>Drill is designed with extensibility
in mind. It provides out-of-the-box connectivity to file systems (local or distributed file
systems such as S3, HDFS and MapR-FS), HBase and Hive. You can implement a storage plugin
to make Drill work with any other data source. Drill can combine data from multiple data sources
on the fly in a single query, with no centralized metadata definitions. Here's a query that
combines data from a Hive table, an HBase table (view) and a JSON file:</p>
+{% highlight sql %}
+SELECT custview.membership, sum(orders.order_total) AS sales
+FROM hive.orders, custview, dfs.`clicks/clicks.json` c 
+WHERE orders.cust_id = custview.cust_id AND orders.cust_id = c.user_info.cust_id 
+GROUP BY custview.membership
+ORDER BY 2;
+{% endhighlight %}
+
+<h2>8. User-Defined Functions (UDFs)</h2><p>Drill exposes a simple and
high-performance Java API to build custom functions (UDFs and UDAFs) so that you can add your
own business logic. If you have already built UDFs in Hive, you can reuse them with Drill
with no modifications. Refer to <a href="https://cwiki.apache.org/confluence/display/DRILL/Develop+Custom+Functions">Developing
Custom Functions</a> for more information.
+</p>
+
+<h2>9. High performance</h2><p>Drill is designed fround the ground up for
high throughput and low latency. It doesn't use a general purpose execution engine like MapReduce,
Tez or Spark. As a result, Drill is able to deliver its unparalleled flexibility (schema-free
JSON model) without compromising performance. Drill's optimizer leverages rule- and cost-based
techniques, as well as data locality and operator push-down (the ability to push down query
fragments into the back-end data sources). Drill also provides a columnar and vectorized execution
engine, resulting in higher memory and CPU efficiency.</p>
+
+<h2>10. Scales from a single laptop to a 1000-node cluster</h2><p>Drill
is available as a simple download you can run on your laptop. When you're ready to analyze
larger datasets, simply deploy Drill on your Hadoop cluster (up to 1000 commodity servers).
Drill leverages the aggregate memory in the cluster to execute queries using an optimistic
pipelined model, and automatically spills to disk when the working set doesn't fit in memory.</p>


Mime
View raw message