airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ruslan Dautkhanov <dautkha...@gmail.com>
Subject Fwd: Integrating with Airflow
Date Tue, 30 May 2017 14:58:26 GMT
Forwarding from Zeppelin user group.

Thought might be interesting for Airflow users too, how others are
integrating Airflow and Zeppelin.



---------- Forwarded message ----------
From: Erik Schuchmann <erik.schuchmann@mainstreethub.com>
Date: Tue, May 30, 2017 at 8:53 AM
Subject: Re: Integrating with Airflow
To: users@zeppelin.apache.org


We have begun experimenting with an airflow/zeppelin integration.  We use
the first paragraph of a note to define dependencies and outputs; names and
owners; and schedule for the note. There are utility functions (in scala)
available that provide a data catalog for retrieving data sources.  These
functions return a dataframe and record that note's dependency on a
particular data source so that a dag can be constructed between the task
that creates the input data and the zeppelin note.  There is also a
function to provide a dataframe writer that captures the outputs that a
note provides.  This registers that note as the source for data that is
then available in the data catalog for other notebooks to use.  This allows
one notebook to have a dependency on data created by another notebook.

An airflow dag generator (python code) queries the zeppelin notebook
server, looking for the results of the first paragraph for each note.  It
uses these outputs to construct the DAG between notebooks.  It generates a
ZeppelinNoteOperator for each note that will use the zeppelin REST api to
execute the notebook when the scheduler schedules that task.

We've just started to use this so we don't have a lot of experience with it
yet.  The biggest caveats to start are:

* there is no mechanism for test-cases of note code
* We have to call the notebook server on every iteration of the
scheduler/whenever a dag is init'd - we use the cached results if
available, but it still requires a round trip to the zeppelin notebook
server

Regards,
Erik


On Fri, May 19, 2017 at 10:15 PM Ben Vogan <ben@shopkick.com> wrote:

> Thanks for sharing this Ruslan - I will take a look.
>
> I agree that paragraphs can form tasks within a DAG.  My point was that
> ideally a DAG could encompass multiple notes.  I.e. the completion of one
> note triggers another and so on to complete an entire chain of dependent
> tasks.
>
> For example team A has a note that generates data set A*.  Teams B & C
> each have notes that depend on A* to generate B* & C* for their specific
> purposes.  It doesn't make sense for all of that to have to live in one
> note, but they are all part of a single workflow.
>
> Best,
> --Ben
>
> On Fri, May 19, 2017 at 9:02 PM, Ruslan Dautkhanov <dautkhanov@gmail.com>
> wrote:
>
>> Thanks for sharing this Ben.
>>
>> I agree Zeppelin is a better fit with tighter integration with Spark and
>> built-in visualizations.
>>
>> We have pretty much standardized on pySpark, so here's one of the scripts
>> we use internally
>> to extract %pyspark, %sql and %md paragraphs into a standalone script
>> (that can be scheduled in Airflow for example)
>> https://github.com/Tagar/stuff/blob/master/znote.py (patches are welcome
>> :-)
>>
>> Hope this helps.
>>
>> ps. In my opinion adding dependencies between paragraphs wouldn't be that
>> hard for simple cases,
>> and can be first step to define a DAG in Zeppelin directly. It would be
>> really awesome if we see this type of
>> integration in the future.
>>
>> Othewise I don't see much value if a whole note/ whole workflow would run
>> as a single task in Airflow.
>> In my opinion, each paragraph has to be a task... then it'll be very
>> useful.
>>
>>
>> Thanks,
>> Ruslan
>>
>>
>> On Fri, May 19, 2017 at 4:55 PM, Ben Vogan <ben@shopkick.com> wrote:
>>
>>> I do not expect the relationship between DAGs to be described in
>>> Zeppelin - that would be done in Airflow.  It just seems that Zeppelin is
>>> such a great tool for a data scientists workflow that it would be nice if
>>> once they are done with the work the note could be productionized
>>> directly.  I could envision a couple of scenarios:
>>>
>>> 1. Using a zeppelin instance to run the note via the REST API.  The
>>> instance could be containerized and spun up specifically for a DAG or it
>>> could be a permanently available one.
>>> 2. A note could be pulled from git and some part of the Zeppelin engine
>>> could execute the note without the web UI at all.
>>>
>>> I would expect on the airflow side there to be some special operators
>>> for executing these.
>>>
>>> If the scheduler is pluggable then it should be possible to create a
>>> plug in that talks to the Airflow REST API.
>>>
>>> I happen to prefer Zeppelin to Jupyter - although I get your point about
>>> both being python.  I don't really view that as a problem - most of the big
>>> data platforms I'm talking to are implemented on the JVM after all.  The
>>> python part of Airflow is really just describing what gets run and it isn't
>>> hard to run something that isn't written in python.
>>>
>>> On Fri, May 19, 2017 at 2:52 PM, Ruslan Dautkhanov <dautkhanov@gmail.com
>>> > wrote:
>>>
>>>> We also use both Zeppelin and Airflow.
>>>>
>>>> I'm interested in hearing what others are doing here too.
>>>>
>>>> Although honestly there might be some challenges
>>>> - Airflow expects a DAG structure, while a notebook has pretty linear
>>>> structure;
>>>> - Airflow is Python-based; Zeppelin is all Java (REST API might be of
>>>> help?).
>>>> Jupyter+Airflow might be a more natural fit to integrate?
>>>>
>>>> On top of that, the way we use Zeppelin is a lot of ad-hoc queries,
>>>> while Airflow is for more finalized workflows I guess?
>>>>
>>>> Thanks for bringing this up.
>>>>
>>>>
>>>>
>>>> --
>>>> Ruslan Dautkhanov
>>>>
>>>> On Fri, May 19, 2017 at 2:20 PM, Ben Vogan <ben@shopkick.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> We are really enjoying the workflow of interacting with our data via
>>>>> Zeppelin, but are not sold on using the built in cron scheduling
>>>>> capability.  We would like to be able to create more complex DAGs that
are
>>>>> better suited for something like Airflow.  I was curious as to whether
>>>>> anyone has done an integration of Zeppelin with Airflow.
>>>>>
>>>>> Either directly from within Zeppelin, or from the Airflow side.
>>>>>
>>>>> Thanks,
>>>>> --
>>>>> *BENJAMIN VOGAN* | Data Platform Team Lead
>>>>>
>>>>> <http://www.shopkick.com/>
>>>>> <https://www.facebook.com/shopkick>
>>>>> <https://www.instagram.com/shopkick/>
>>>>> <https://www.pinterest.com/shopkick/>
>>>>> <https://twitter.com/shopkickbiz>
>>>>> <https://www.linkedin.com/company-beta/831240/?pathWildcard=831240>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *BENJAMIN VOGAN* | Data Platform Team Lead
>>>
>>> <http://www.shopkick.com/>
>>> <https://www.facebook.com/shopkick>
>>> <https://www.instagram.com/shopkick/>
>>> <https://www.pinterest.com/shopkick/> <https://twitter.com/shopkickbiz>
>>> <https://www.linkedin.com/company-beta/831240/?pathWildcard=831240>
>>>
>>
>>
>
>
> --
> *BENJAMIN VOGAN* | Data Platform Team Lead
>
> <http://www.shopkick.com/>
> <https://www.facebook.com/shopkick> <https://www.instagram.com/shopkick/>
> <https://www.pinterest.com/shopkick/> <https://twitter.com/shopkickbiz>
> <https://www.linkedin.com/company-beta/831240/?pathWildcard=831240>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message