airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arthur Wiedmer <arthur.wied...@gmail.com>
Subject Re: Flow-based Airflow?
Date Wed, 25 Jan 2017 18:04:48 GMT
>From our own data warehouse, there are definitely cases where knowing that
the data is there is not enough. While I agree that ideally the dependency
in data should be explicit, the current dependency engine allows you to
compress some of the data dependencies by using the task dependencies.


For instance, we sometimes use additional data quality checks before we
proceed :
Data Sensor ----> DQ check ----> operator

Rendering the dependency explicit would lead to this below :

Data ----> DQ check ----> operator
        \_________________/

This is not inherently bad, but I feel that the dependency is redundant. Of
course, the additional checks could be somehow encoded in the dependency,
but it does not feel as clean to me, especially if the data quality check
is resource intensive.

Here my dependency is not so much on the data being available as it is on
the data being of the quality I need.

Best,
Arthur

On Jan 25, 2017 7:09 AM, "Jeremiah Lowin" <jlowin@apache.org> wrote:

> At the simplest level, a data-dependency should just create an automatic
> task-dependency (since a task shouldn't run before required data is
> available). Therefore it should be possible to reason about dataflow using
> the existing dependency framework.
>
> Is there any reason that wouldn't hold for all dataflow scenarios?
>
> Then the only differentiation becomes whether a task-dependency was defined
> explicitly by a user or implicitly by a data-dependency.
>
> On Tue, Jan 24, 2017 at 11:23 AM Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > I'm happy working on a design doc. I don't think Sankeys are the way to
> go
> > as they are typically used to show some metric (say number of users
> flowing
> > through pages on a website), and even if we'd have something like row
> count
> > throughout I don't think we'd want to make it that centric to the
> > visualization.
> >
> > I think good old graphs are where it's at. Either overloading the current
> > graph view with extra options (current view untouched, current view +
> > lineage (a graph where nodes are tasks or data objects,  data objects
> have
> > a different shape), lineage only view).
> >
> > On Mon, Jan 23, 2017 at 11:16 PM, Gerard Toonstra <gtoonstra@gmail.com>
> > wrote:
> >
> > > data lineage is one of the things you mentioned in an early
> presentation
> > > and I was wondering about it.
> > >
> > > I wouldn't mind setting up an initial contribution towards achieving
> > that,
> > > but would like to understand
> > > the subject a bit better. The easiest MVP is to use the annotations
> > method
> > > to simply show how
> > > data flows, but you mention other things that need to be done in the
> > third
> > > paragraph. If a wiki could
> > > be written on the subject, explaining why those things are done, we can
> > set
> > > up a discussion and
> > > create an epic with jira issues to realize that.
> > >
> > > The way I think this can be visualized is perhaps through a sankey
> > diagram,
> > > which helps to make
> > > complex systems more understandable, eg:
> > > - how is transaction margin calculated?  What is all the source data?
> > > - where does customer data go to and are those systems compliant?
> > > - what is the overall data dependency between systems and can these be
> > > reduced?
> > > - which data gets used everywhere?
> > > - which end systems consume from the most diverse sources of data?
> > >
> > > and other questions appropriate for data lineage.
> > >
> > > Rgds,
> > >
> > > Gerard
> > >
> > >
> > > On Tue, Jan 24, 2017 at 2:04 AM, Maxime Beauchemin <
> > > maximebeauchemin@gmail.com> wrote:
> > >
> > > > A few other thoughts related to this. Early on in the project, I had
> > > > designed but never launched a feature called "data lineage
> annotations"
> > > > allowing people to define a list of sources, and a list of targets
> > > related
> > > > to a each task for documentation purposes. My idea was to use a
> simple
> > > > annotation string that would uniquely map to a data object. Perhaps a
> > URI
> > > > as  in `{connection_type}://{conn_id}/{something_unique}` or
> something
> > > to
> > > > that effect.
> > > >
> > > > Note that operators could also "infer" lineage based on their input
> > > > (HiveOperator could introspect the HQL statement to figure out input
> > and
> > > > outputs for instance), and users could override the inferred lineage
> if
> > > so
> > > > desired, either to abstract complexity like temp tables and such, to
> > > > correct bad inference (SQL parsing is messy), or in cases where
> > operators
> > > > wouldn't implement the introspection functions.
> > > >
> > > > Throw a `data_object_exist(data_object_uri)` and a
> > > > `clear_data_object(data_object_uri)` method in existing hooks, and a
> > > > `BaseOperator.use_target_presence_as_state=False` boolean and some
> > > > handling
> > > > of in the dependency engine and while "clearing" and we're not too
> far
> > > from
> > > > a solution.
> > > >
> > > > As a more generic alternative, potentially task states could be
> handled
> > > by
> > > > a callback when so-desired. For this, all we'd need to do is to add a
> > > > `status_callback(dag, task, task_instance)` callback to BaseOperator,
> > and
> > > > evaluate it for state in place of the database state where user
> > specify.
> > > >
> > > > Max
> > > >
> > > > On Mon, Jan 23, 2017 at 12:23 PM, Maxime Beauchemin <
> > > > maximebeauchemin@gmail.com> wrote:
> > > >
> > > > > Just commented on the blog post:
> > > > >
> > > > > ----------------------------
> > > > > I agree that workflow engines should expose a way to document data
> > > > objects
> > > > > it reads from and writes to, so that it can be aware of the full
> > graph
> > > of
> > > > > tasks and data objects and how it all relates. This metadata allows
> > for
> > > > > clarity around data lineage and potentially deeper integration with
> > > > > external systems.
> > > > > Now there's the question of whether the state of a workflow should
> be
> > > > > inferred based on the presence or absences of related targets. For
> > this
> > > > > specific question I'd argue that the workflow engine needs to
> manage
> > > its
> > > > > own state internally. Here are a few reasons why: * many
> maintenance
> > > > > tasks don't have have a physical output, forcing the creation of
> > dummy
> > > > > objects representing state * external systems have no guarantees
as
> > to
> > > > > how quickly you can check for the existence of an object, therefore
> > > > > computing what task can run may put a burden on external systems,
> > > poking
> > > > at
> > > > > thousands of data targets (related: the snakebite lib was developed
> > in
> > > > part
> > > > > to help with the Luigi burden on HDFS) * how do you handle the
> > > "currently
> > > > > running" state? a dummy/temporary output? manage this specific
> state
> > > > > internally? * how to handle a state like the "skipped" in Airflow
> > > > > (related to branching)? creating a dummy target? * if you need to
> > > re-run
> > > > > parts of the pipeline (say a specific task and everything
> downstream
> > > for
> > > > a
> > > > > specific date range), you'll need to go and alter/delete the
> presence
> > > of
> > > > a
> > > > > potentially intricate list of targets. This means the workflow
> engine
> > > > needs
> > > > > to be able to delete files in external systems as a way to re-run
> > > tasks.
> > > > > Note that you may not always want to take these targets offline for
> > the
> > > > > duration of the backfill. * if some tasks are using staging or
> > > temporary
> > > > > tables, cleaning those up to regain space would re-trigger the
> task,
> > so
> > > > > you'll have to trick the system into achieving what you want to do
> > > > > (overwriting with an empty target?), perhaps changing your unit of
> > work
> > > > by
> > > > > creating larger tasks that include the temporary table step, but
> that
> > > may
> > > > > not be the unit-of-work that you want From my perspective, to run
a
> > > > > workflow engine at scale you need to manage its state internally
> > > because
> > > > > you need strong guarantees as to reading and altering that state.
I
> > > agree
> > > > > that ideally the workflow engine should know about input and output
> > > data
> > > > > objects (this is not the case currently in Airflow), and it would
> be
> > a
> > > > real
> > > > > nice thing to be able to diff & sync state across its internal
> state
> > > and
> > > > > external one (presence of targets), but may be challenging.
> > > > >
> > > > > Max
> > > > >
> > > > > On Mon, Jan 23, 2017 at 8:05 AM, Bolke de Bruin <bdbruin@gmail.com
> >
> > > > wrote:
> > > > >
> > > > >> Hi All,
> > > > >>
> > > > >> I came by a write up of some of the downsides in current workflow
> > > > >> management systems like Airflow and Luigi (
> > > > http://bionics.it/posts/workf
> > > > >> lows-dataflow-not-task-deps) where they argue dependencies should
> be
> > > > >> between inputs and outputs of tasks rather than between tasks
> > > > >> (inlets/outlets).
> > > > >>
> > > > >> They extended Luigi (https://github.com/pharmbio/sciluigi) to
do
> > this
> > > > >> and even published a scientific paper on it:
> > > > >> http://jcheminf.springeropen.com/articles/10.1186/s13321-016
> -0179-6
> > .
> > > > >>
> > > > >> I kind of like the idea, has anyone played with it, any thoughts?
> I
> > > > might
> > > > >> want to try it in Airflow.
> > > > >>
> > > > >> Bolke
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message