airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Driesprong, Fokko" <fo...@driesprong.frl>
Subject Re: Return results optionally from spark_sql_hook
Date Sat, 14 Oct 2017 08:53:19 GMT
Hi Boris,

Thank you for your question and excuse me for the late response, currently
I'm on holiday.

The solution that you suggest, would not be my preferred choice. Extracting
results from a log using a regex is expensive in terms of computational
costs, and error prone. My question is, what are you trying to accomplish?
For me there are two ways of using the Spark-sql operator:

   1. ETL Using Spark: Instead of returning the results, write the results
   back to a new table, or a new partition within the table. This data can be
   used downstream in the dag. Also, this will write the data to hdfs which is
   nice for persistance.
   2. Write the data in a simple and widely supported format (such as csv)
   onto hdfs. Now you can get the data from hdfs using `hdfs dfs -get` to you
   local file-system. Or use `hdfs dfs -cat ... | application.py` to pipe it
   to your application directly.

What you are trying to accomplish, looks for me something that would fit
the spark-submit job, where you can submit pyspark applications where you
can directly fetch the results from Spark:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Python version 2.7.14 (default, Oct 11 2017 10:13:33)
SparkSession available as 'spark'.
>>> spark.sql("SELECT 1 as count").first()
Row(count=1)

Most of the time we use the Spark-sql to transform the data, then use sqoop
to get the data from hdfs to a rdbms to expose the data to the business.
These examples are for Spark using hdfs, but for s3 it is somewhat the same.

Does this answer your question, if not, could you elaborate the problem
that you are facing?

Ciao, Fokko




2017-10-13 15:54 GMT+02:00 Boris <boriskey@gmail.com>:

> hi guys,
>
> I opened JIRA on this and will be working on PR
> https://issues.apache.org/jira/browse/AIRFLOW-1713
>
> any objections/suggestions conceptually?
>
> Fokko, I see you have been actively contributing to spark hooks and
> operators so I could use your opinion before I implement this.
>
> Boris
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message