airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Laura Lorenz <llor...@industrydive.com>
Subject Re: Flow-based Airflow?
Date Thu, 02 Feb 2017 17:36:38 GMT
This is great!

We work with a lot of external data in wildly non-standard formats so
another enhancement here we'd use and support is passing customizable
serializers to Dataflow subclasses. This would let the dataflows keyword
arg for a task handle dependency management, the Dataflow class or
subclasses handle IO, and the Serializer subclasses handle parsing.

Happy to contribute here, perhaps to create an S3Dataflow subclass in the
style of your Google Cloud storage one for this PR.

Laura

On Wed, Feb 1, 2017 at 6:14 PM, Jeremiah Lowin <jlowin@apache.org> wrote:

> Great point. I think the best solution is to solve this for all XComs by
> checking object size before adding it to the DB. I don't see a built in way
> of handling it (though apparently MySQL is internally limited to 64kb).
> I'll look into a PR that would enforce a similar limit for all databases.
>
> On Wed, Feb 1, 2017 at 4:52 PM Maxime Beauchemin <
> maximebeauchemin@gmail.com>
> wrote:
>
> I'm not sure about XCom being the default, it seems pretty dangerous. It
> just takes one person that is not fully aware of the size of the data, or
> one day with an outlier and that could put the Airflow db in jeopardy.
>
> I guess it's always been an aspect of XCom, and it could be good to have
> some explicit gatekeeping there regardless of this PR/feature. Perhaps the
> DB itself has protection against large blobs?
>
> Max
>
> On Wed, Feb 1, 2017 at 12:42 PM, Jeremiah Lowin <jlowin@apache.org> wrote:
>
> > Yesterday I began converting a complex script to a DAG. It turned out to
> be
> > a perfect test case for the dataflow model: a big chunk of data moving
> > through a series of modification steps.
> >
> > So I have built an extensible dataflow extension for Airflow on top of
> XCom
> > and the existing dependency engine:
> > https://issues.apache.org/jira/browse/AIRFLOW-825
> > https://github.com/apache/incubator-airflow/pull/2046 (still waiting for
> > tests... it will be quite embarrassing if they don't pass)
> >
> > The philosophy is simple:
> > Dataflow objects represent the output of upstream tasks. Downstream tasks
> > add Dataflows with a specific key. When the downstream task runs, the
> > (optionally indexed) upstream result is available in the downstream
> context
> > under context['dataflows'][key]. In addition, PythonOperators receive the
> > data as a keyword argument.
> >
> > The basic Dataflow serializes the data through XComs, but is trivially
> > extended to alternative storage via subclasses. I have provided (in
> > contrib) implementations of a local filesystem-based Dataflow as well as
> a
> > Google Cloud Storage dataflow.
> >
> > Laura, I hope you can have a look and see if this will bring some of your
> > requirements in to Airflow as first-class citizens.
> >
> > Jeremiah
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message