drill-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dz...@apache.org
Subject [drill] 01/02: Drill provider for Airflow blog post.
Date Sun, 22 Aug 2021 17:58:50 GMT
This is an automated email from the ASF dual-hosted git repository.

dzamo pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/drill.git

commit aa99123c5690cfacb74925df740b02f5c3b6350b
Author: James Turton <james@somecomputer.xyz>
AuthorDate: Thu Aug 5 16:01:44 2021 +0200

    Drill provider for Airflow blog post.
---
 .../install/047-installing-drill-on-the-cluster.md |  2 +-
 ...leased.md => 2018-03-18-drill-1.13-released.md} |  0
 .../en/2021-08-05-drill-provider-for-airflow.md    | 28 ++++++++++++++++++++++
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/_docs/en/install/047-installing-drill-on-the-cluster.md b/_docs/en/install/047-installing-drill-on-the-cluster.md
index de359b0..2761af9 100644
--- a/_docs/en/install/047-installing-drill-on-the-cluster.md
+++ b/_docs/en/install/047-installing-drill-on-the-cluster.md
@@ -16,7 +16,7 @@ You install Drill on nodes in the cluster, configure a cluster ID, and add
Zooke
 
 ### (Optional) Create the site directory
 
-The site directory contains your site-specific files for Drill.  Putting these in a separate
directory to the Drill installation means that upgrading Drill will not clobber your configuration
and custom code.  It is possible to skip this step, meaning that your configuration and custom
code will live in the `$DRILL_HOME/conf` and `$DRILL_HOME/jars/3rdparty` subdirectories respectively.
+The site directory contains your site-specific files for Drill.  Putting these in a separate
directory to the Drill installation means that upgrading Drill will not overwrite your configuration
and custom code.  It is possible to skip this step, meaning that your configuration and custom
code will live in the `$DRILL_HOME/conf` and `$DRILL_HOME/jars/3rdparty` subdirectories respectively.
 
 Create the site directory in a suitable location, e.g.
 
diff --git a/blog/_posts/en/2018-3-18-drill-1.13-released.md b/blog/_posts/en/2018-03-18-drill-1.13-released.md
similarity index 100%
rename from blog/_posts/en/2018-3-18-drill-1.13-released.md
rename to blog/_posts/en/2018-03-18-drill-1.13-released.md
diff --git a/blog/_posts/en/2021-08-05-drill-provider-for-airflow.md b/blog/_posts/en/2021-08-05-drill-provider-for-airflow.md
new file mode 100644
index 0000000..b643924
--- /dev/null
+++ b/blog/_posts/en/2021-08-05-drill-provider-for-airflow.md
@@ -0,0 +1,28 @@
+---
+layout: post
+title: "Drill provider for Airflow"
+code: drill-provider-for-airflow
+excerpt: In its provider package release this month, the Apache Airflow project added a provider
for interacting with Apache Drill.  This allows data engineers and data scientists to incorporate
Drill queries in their Airflow DAGs, enabling the automation of big data and data science
workflows.
+
+authors: ["jturton"]
+---
+
+You're building a new report, visualisation or ML model.  Most of the data involved comes
from sources well known to you but a new source has become available, allowing your team to
measure and model new variables.  Eager to get to a prototype and an early sense of what the
new analytics look like, you head straight for the first order of business and start to construct
a first version of the dataset upon which your final output will be based.
+
+The data sources you need to combine are immediately accessible but heteregenous: transactional
data in PostgreSQL must be combined with data from another team that uses Splunk, lookup data
maintained by operational team in an Excel spreadsheet, thousands of XML exports received
from a partner and some Parquet files already in your big data environment just for good measure.
+
+Using Drill iteratively you query and join in each data source one at a time, applying grouping,
filtering and other intensive transformations as you go, finally producing a dataset with
the fields and grain you need.  You store it by adding CREATE TABLE AS in front of your final
SELECT then write a few counting and summing queries against the original data sources and
your transformed dataset to check that your code produces the expected outputs.
+
+Apart from possibly configuring some new storage plugins in the Drill web UI, you have so
far not left DBeaver (or your editor of choice).  The onerous data exploration and plumbing
parts of your project have flashed by in a blaze of SQL, and you move your dataset into the
next tool for visualisation or modelling.  The results are good and you know that your users
will immediately ask for the outputs to incorporate new data on a regular schedule.
+
+While Drill can assemble your dataset on the fly, as it did while you prototyped,  doing
that for the full set takes over 20 minutes, places more load than you'd like in office hours
on to your data sources and limits you to the history that the sources keep, in some cases
only a few weeks.
+
+It's time for ETL, you concede.  In the past that meant you had to choose between keeping
your working Drill SQL and scheduling it using 70s Unix tools like Cron and Bash, or recreating
your Drill SQL in other tools and languages, perhaps Apache Beam or PySpark, and requiring
multiple tools if you don't have one that is as omnivorous as Drill.  But this time it's different...
+
+[Apache Airflow](https://airflow.apache.org) is a workflow engine built in the Python programming
ecosystem that has grown into a leading choice for orchestrating big data pipelines, amongst
its other applications.  Perhaps the first point to understand about Airflow in the context
of ETL is that it is designed only for workflow _control_, and not for data flow.  This makes
it different from some of the ETL tools you might have encountered like Microsoft's SSIS or
Pentaho's PDI which han [...]
+
+In contrast Airflow is, unless you're doing it wrong, used only to instruct other software
like Spark, Beam, PostgreSQL, Bash, Celery, Scikit-learn scripts, Slack, (... the list  of
connectors is long and varied) to kick off actions at scheduled times.  While Airflow does
load its schedules from the crontab format, a comparison to cron stops there.  Airflow can
resolve and execute complex job DAGs with options for clustering, parallelism, retries, backfilling
and performance monitoring.
+
+The exciting news for Drill users is that [a new provider package adding support for Drill](https://pypi.org/project/apache-airflow-providers-apache-drill/)
was added to Airflow this month.  This provider is based on the [sqlalchemy-drill package](https://pypi.org/project/sqlalchemy-drill/)
which provides Drill connectivity for Python programs.  This means that you can add tasks
which execute queries on Drill to your Airflow DAGs without any hacky intermediate shell scripts,
or build new [...]
+
+In the coming days a basic tutorial for using Drill with Airflow will be added to this site,
and this sentence replaced with a link.

Mime
View raw message