drill-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tshi...@apache.org
Subject drill git commit: Blog post on 2015 roadmap
Date Tue, 16 Dec 2014 21:36:10 GMT
Repository: drill
Updated Branches:
  refs/heads/gh-pages 9e8de1756 -> b172960ae

Blog post on 2015 roadmap

Project: http://git-wip-us.apache.org/repos/asf/drill/repo
Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/b172960a
Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/b172960a
Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/b172960a

Branch: refs/heads/gh-pages
Commit: b172960ae0cbf066ba6b9bfad4ef88bb7644547c
Parents: 9e8de17
Author: Tomer Shiran <tshiran@gmail.com>
Authored: Tue Dec 16 13:35:59 2014 -0800
Committer: Tomer Shiran <tshiran@gmail.com>
Committed: Tue Dec 16 13:35:59 2014 -0800

 blog/_posts/2014-12-16-whats-coming-in-2015.md | 132 ++++++++++++++++++++
 css/style.css                                  |  15 ++-
 download.html                                  |   8 +-
 3 files changed, 148 insertions(+), 7 deletions(-)

diff --git a/blog/_posts/2014-12-16-whats-coming-in-2015.md b/blog/_posts/2014-12-16-whats-coming-in-2015.md
new file mode 100644
index 0000000..182f821
--- /dev/null
+++ b/blog/_posts/2014-12-16-whats-coming-in-2015.md
@@ -0,0 +1,132 @@
+layout: post
+title: "What's Coming in 2015?"
+code: whats-coming-in-2015
+excerpt: Drill is now a top-level project, and the community is expanding rapidly. Find out
more about some of the new features planned for 2015.
+authors: ["Tomer Shiran, Apache Drill Founder, PMC Member and Committer"]
+2014 was an exciting year for the Drill community. In August we made Drill available for
downloads, and last week the Apache Software Foundation promoted Drill to a top-level project.
Many of you have asked me what's coming next, so I decided to sit down and outline some of
the interesting initiatives that the Drill community is currently working on:
+* Flexible Access Control
+* JSON in Any Shape or Form
+* Advanced SQL
+* New Data Sources
+* Drill/Spark Integration
+* Operational Enhancements: Speed, Scalability and Workload Management
+This is by no means intended to be an exhaustive list of everything that will be added to
Drill in 2015. With Drill's rapidly expanding community, I anticipate that you'll see a whole
lot more.
+## Flexible Access Control
+Many organizations are now interested in providing Drill as a service to their users, supporting
many users, groups and organizations with a single cluster. To do so, they need to be able
to control who can access what data. Today's volume and variety of data requires a new approach
to access control. For example, it is becoming impractical for organizations to manage a standalone,
centralized repository of permissions for every column/row of every table. Drill's virtual
datasets (views) provide a more scalable solution to access control:
+* The user creates a virtual dataset (`CREATE VIEW vd AS SELECT ...`), selecting the data
to be exposed/shared. The virtual dataset is defined as a SQL statement. For example, a virtual
dataset may represent only the records that were created in the last 30 days and don't have
the `restricted` flag. It could even mask some columns. Drill's virtual datasets (just the
SQL statement) are stored as files in the file system, so users can leverage file system permissions
to control who can access the virtual dataset, without granting access to the source data.
+* A virtual dataset is owned by a specific user and can only "select" data that the owner
has access to. The data sources (HDFS, HBase, MongoDB, etc.) are responsible for access control
decisions. Users and administrators do not need to define separate permissions inside Drill
or utilize yet another centralized permission repository, such as Sentry and Ranger.
+## JSON in Any Shape or Form
+When data is **Big** (as in Big Data), it is painful to copy and transform it. Users should
be able to explore the raw data without (or at least prior to) transforming it into another
format. Drill is designed to enable in-situ analytics. Just point it at a file or directory
and run the queries.
+JSON has emerged as the most common self-describing format, and Drill is able to query JSON
files out of the box. Drill currently assumes that the JSON documents (or records) are stored
sequentially in a file:
+{ "name": "Lee", "yelping_since": "2012-02" }
+{ "name": "Matthew", "yelping_since": "2011-12" }
+{ "name": "Jasmine", "yelping_since": "2010-09" }
+However, many JSON-based datasets, ranging from [data.gov](http://data.gov) (government)
datasets to Twitter API responses, are not organized as simple sequences of JSON documents.
In some cases the actual records are listed as elements of an internal array inside a single
JSON document. For example, consider the following file, which technically consists of a single
JSON document, but really contains three records (under the `data.records` field):
+  "metadata": ...,
+  "data": {
+    "records": [
+      { "name": "Lee", "yelping_since": "2012-02" },
+      { "name": "Matthew", "yelping_since": "2011-12" },
+      { "name": "Jasmine", "yelping_since": "2010-09" }
+    ]
+  }
+The `FLATTEN` function in Drill 0.7+ takes an array and converts each item into a top-level
+SELECT FLATTEN(data.records) FROM dfs.tmp.`foo.json`;
+You can use this as an inner query (or inside a view):
+> SELECT t.record.name AS name
+  FROM (SELECT FLATTEN(data.records) AS record FROM dfs.tmp.`test/foo.json`) t;
+|    name    |
+| Lee        |
+| Matthew    |
+| Jasmine    |
+While this works today, the dataset is technically a single JSON document, so Drill ends
up reading the entire dataset into memory. We're developing a FLATTEN-pushdown mechanism that
will enable the JSON reader to emit the individual records into the downstream operators,
thereby making this work with datasets of arbitrary size. Once that's implemented, users will
be able to explore any JSON-based dataset in-situ (ie, without having to transform it).
+## Full SQL
+Unlike the majority of SQL engines for Hadoop and NoSQL databases, which support SQL-like
languages (HiveQL, CQL, etc.), Drill is designed from the ground up to be compliant with ANSI
SQL. We simply started with a real SQL parser (Apache Calcite, previously known as Optiq).
We're currently implementing the remaining SQL constructs, and plan to support the full TPC-DS
suite (with no query modifications) in 2015. Full SQL support makes BI tools work better,
and enables users who are proficient with SQL to leverage their existing knowledge and skills.
+## New Data Sources
+Drill is a standalone, distributed SQL engine. It has a pluggable architecture that allows
it to support multiple data sources. Drill 0.6 includes storage plugins for:
+* [Hadoop File System](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html)
implementations (local file system, HDFS, MapR-FS, Amazon S3, etc.)
+* HBase and MapR-DB
+* MongoDB
+* Hive Metastore (query any dataset that is registered in Hive Metastore)
+A single query can join data from different systems. For example, a query can join user profiles
in MongoDB with log files in Hadoop, or datasets in multiple Hadoop clusters.
+I'm eager to see what storage plugins the community develops over the next 12 months. In
the last few weeks alone, developers in the community have expressed their desire (on the
[public list](mailto:dev@drill.apache.org)) to develop additional storage plugins for the
following data sources:
+* Cassandra
+* Solr
+* JDBC (any RDBMS, including Oracle, MySQL, PostgreSQL and SQL Server)
+If you're interested in implementing a new storage plugin, I would encourage you to reach
out to the Drill developer community on <dev@drill.apache.org>. I'm looking forward
to publishing an example of a single-query join across 10 data sources.
+## Drill/Spark Integration
+We're seeing growing interest in Spark as an execution engine for data pipelines, providing
an alternative to MapReduce. The Drill community is working on integrating Drill and Spark
to address a few new use cases:
+* Use a Drill query (or view) as the input to Spark. Drill is a powerful engine for extracting
and pre-processing data from various data sources, thereby reducing development time and effort.
Here's an example:
+    ```scala
+    val sc = new SparkContext(conf)
+    val result = sc.drillRDD("SELECT * FROM dfs.root.`path/to/logs` l, mongo.mydb.users u
WHERE l.user_id = u.id GROUP BY ...")
+    val formatted = result.map { r =>
+      val (first, last, visits) = (r.name.first, r.name.last, r.visits)
+      s"$first $last $visits"
+    }
+    ```
+* Use Drill to query Spark RDDs. Analysts will be able to use BI tools like MicroStrategy,
Spotfire and Tableau to query in-memory data in Spark. In addition, Spark developers will
be able to embed Drill execution in a Spark data pipeline, thereby enjoying the power of Drill's
schema-free, columnar execution engine.
+## Operational Enhancements
+As we continue with our monthly releases and march towards the 1.0 release early next year,
we're focused on improving Drill's speed and scalability. We'll also enhance Drill's multi-tenancy
with more advanced workload management.
+* **Speed**: Drill is already extremely fast, and we're going to make it even faster over
the next few months. With that said, we think that improving user productivity and time-to-insight
is as important as shaving a few milliseconds off a query's runtime.
+* **Scalability**: To date we've focused mainly on clusters of up to a couple hundred nodes.
We're currently working to support clusters with thousands of nodes. We're also improving
concurrency to better support deployments in which hundreds of analysts or developers are
running queries at the same time.
+* **Workload management**: A single cluster is often shared among many users and groups,
and everyone expects answers in real-time. Workload management prioritizes the allocation
of resources to ensure that the most important workloads get done first so that business demands
can be met. Administrators need to be able to assign priorities and quotas at a fine granularity.
We're working on enhancing Drill's workload management to provide these capabilities while
providing tight integration with YARN and Mesos.
+## We Would Love to Hear From You!
+Are there other features you would like to see in Drill? We would love to hear from you:
+* Drill users: <user@drill.apache.org>
+* Drill developers: <dev@drill.apache.org>
+* Me: <tshiran@apache.org>
+Happy Drilling!  
+Tomer Shiran
\ No newline at end of file

diff --git a/css/style.css b/css/style.css
index 02a7e78..198ea49 100755
--- a/css/style.css
+++ b/css/style.css
@@ -788,7 +788,7 @@ div.download table a {
 	background-size:16px auto;
 	background-position:17px center;
-	padding:0 35px 0 45px;
+	padding:10px 35px 10px 45px;
@@ -806,12 +806,21 @@ div.download table a.dl:hover {
 div.download table a.find {
-	background-color:#1a6bc7;
+	background-color:#4aaf4c;
 div.download table a.find:hover {
-	background-color:#145aa8;
+	background-color:#348436;
+div.download table a.tutorial {
+    background-color:#1a6bc7;
+    background-image:url(../images/btn-lens.png);
+div.download table a.tutorial:hover {
+    background-color:#145aa8;
 p.info {

diff --git a/download.html b/download.html
index 032eba9..8172743 100755
--- a/download.html
+++ b/download.html
@@ -8,15 +8,15 @@ title: Download
-      <td><a href="http://www.apache.org/dyn/closer.cgi/drill/drill-0.6.0-incubating/apache-drill-0.6.0-incubating.tar.gz"
class="find" id="apachemirror" style="background-color: #4aaf4c;">FIND AN APACHE MIRROR</a></td>
+      <td><a href="http://www.apache.org/dyn/closer.cgi/drill/drill-0.6.0-incubating/apache-drill-0.6.0-incubating.tar.gz"
class="find" id="apachemirror">FIND AN APACHE MIRROR</a></td>
       <td><a href="http://getdrill.org/drill/download/apache-drill-0.6.0-incubating.tar.gz"
rel="nofollow" class="dl" id="directdownload">DIRECT FILE DOWNLOAD</a></td>
       <td><a href="http://doc.mapr.com/display/MapR/Step+1.+Install+the+MapR+Drill+ODBC+Driver"
rel="nofollow" class="dl">ODBC DRIVERS FOR DRILL*</a></td>
   <p style="margin-top:1px; padding-top:1px;">
-    <strong>Release Notes: </strong><a href="https://cwiki.apache.org/confluence/display/DRILL/Release+Notes">&nbsp;Click
here</a> &nbsp;&nbsp;|&nbsp;&nbsp;
-    <strong>Fork Drill 0.6 on GitHub: </strong><a href="https://github.com/apache/drill/tree/0.6.0-incubating"
rel="nofollow">&nbsp;Click here</a>
+    <strong>Release Notes: </strong><a href="https://cwiki.apache.org/confluence/display/DRILL/Release+Notes">
Click here</a> &nbsp;&nbsp;|&nbsp;&nbsp;
+    <strong>Fork Drill 0.6 on GitHub: </strong><a href="https://github.com/apache/drill/tree/0.6.0-incubating"
rel="nofollow">Click here</a>
@@ -26,7 +26,7 @@ title: Download
-      <td style="padding-left: 38px"><a href="https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Tutorial"
rel="nofollow" target="_blank" class="find">DRILL TUTORIAL</a></td>
+      <td style="padding-left: 38px"><a href="https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Tutorial"
rel="nofollow" target="_blank" class="tutorial">DRILL TUTORIAL</a></td>

View raw message