airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Yang <yrql...@gmail.com>
Subject Re: [DISCUSS] AIP-12 Persist DAG into DB
Date Fri, 08 Mar 2019 11:52:39 GMT
Ty Xiangdong, my bad there. Attached the file to this email and also
uploaded it here <https://photos.app.goo.gl/Rr5BsHvxXEXnbY5K7> and here
<https://imgur.com/ncqqQgc>.

Cheers,
Kevin Y

On Fri, Mar 8, 2019 at 3:42 AM Deng Xiaodong <xd.deng.r@gmail.com> wrote:

> Hi Kevin,
>
> The image you attached is not displayed properly. May you consider
> uploading it somewhere then provide a link instead?
>
> Thanks!
>
> XD
>
> On Fri, Mar 8, 2019 at 19:38 Kevin Yang <yrqls21@gmail.com> wrote:
>
> > Hi all,
> > When I was preparing some work related to this AIP I found something very
> > concerning. I noticed this JIRA ticket
> > <https://issues.apache.org/jira/browse/AIRFLOW-3562> is trying to remove
> > the dependency of dagbag from webserver, which is awesome--we wanted
> badly
> > but never got to start work on. However when I looked at some subtasks of
> > it, which try to remove dagbag dependency from each endpoint, I found the
> > way we remove the dependency of dagbag is not very ideal. For example
> this
> > PR <https://github.com/apache/airflow/pull/4867/files> will require us
> to
> > parse the dag file each time we hit the endpoint.
> >
> > If we go down this path, we indeed can get rid of the dagbag dependency
> > easily, but we will have to 1. increase the DB load( not too concerning
> at
> > the moment ), 2. wait the DAG file to be parsed before getting the page
> > back, potentially multiple times. DAG file can sometimes take quite a
> while
> > to parse, e.g. we have some framework DAG files generating large number
> of
> > DAGs from some static config files or even jupyter notebooks and they can
> > take 30+ seconds to parse. Yes we don't like large DAG files but people
> do
> > see the beauty of code as config and sometimes heavily abuseleverage it.
> > Assuming all users have the same nice small python file that can be
> parsed
> > fast, I'm still a bit worried about this approach. Continuing on this
> path
> > means we've chosen DagModel to be the serialized representation of DAG
> and
> > DB columns to hold different properties--it can be one candidate but I
> > don't know if we should settle on that now. I would personally prefer a
> > more compact, e.g. JSON5, and easy to scale representation( such that
> > serializing new fields != DB upgrade).
> >
> > In my imagination we would have to collect the list of dynamic features
> > depending on unserializable fields of a DAG and start a discussion/vote
> on
> > dropping support of them( I'm working on this but if anyone has already
> > done so please take over), decide on the serialized representation of a
> DAG
> > and then replace dagbag with it in webserver. Per previous discussion and
> > some offline discussions with Dan, one future of DAG serialization that I
> > like would look similar to this:
> > [image: airflow_new_arch.jpg]
> > We can still discuss/vote which approach we want to take but I don't want
> > the door to above design to be shut right now or we have to spend a lot
> > effort switch path later.
> >
> > Bas and Peter, I'm very sorry to extend the discussion but I do think
> this
> > is tightly related to the AIP and PRs behind it. And my sincere apology
> for
> > bringing this up so late( I only pull the open PR list occasionally, if
> > there's a way to subscribe to new PR event I'd love to know how).
> >
> > Cheers,
> > Kevin Y
> >
> > On Thu, Feb 28, 2019 at 1:36 PM Peter van t Hof <pjrvanthof@gmail.com>
> > wrote:
> >
> >> Hi all,
> >>
> >> Just some comments one the point Bolke dit give in relation of my PR.
> >>
> >> At first, the main focus is: making the webserver stateless.
> >>
> >> > 1) Make the webserver stateless: needs the graph of the *current* dag
> >>
> >> This is the main goal but for this a lot more PR’s will be coming once
> my
> >> current is merged. For edges and graph view this is covered in my PR
> >> already.
> >>
> >> > 2) Version dags: for consistency mainly and not requiring parsing of
> the
> >> > dag on every loop
> >>
> >> In my PR the historical graphs will be stored for each DagRun. This
> means
> >> that you can see if an older DagRun was the same graph structure, even
> if
> >> some tasks does not exists anymore in the current graph. Especially for
> >> dynamic DAG’s this is very useful.
> >>
> >> > 3) Make the scheduler not require DAG files. This could be done if the
> >> > edges contain all information when to trigger the next task. We can
> then
> >> > have event driven dag parsing outside of the scheduler loop, ie. by
> the
> >> > cli. Storage can also be somewhere else (git, artifactory, filesystem,
> >> > whatever).
> >>
> >> The scheduler is almost untouched in this PR. The only thing that is
> >> added is that this edges are saved to the database but the scheduling
> >> itself din’t change. The scheduler depends now still on the DAG object.
> >>
> >> > 4) Fully serialise the dag so it becomes transferable to workers
> >>
> >> It nice to see that people has a lot of idea’s about this. But as Fokko
> >> already mentioned this is out of scope for the issue what we are trying
> to
> >> solve. I also have some idea’s about this but I like to limit this
> PR/AIP
> >> to the webserver.
> >>
> >> For now my PR does solve 1 and 2 and the rest of the behaviour (like
> >> scheduling) is untouched.
> >>
> >> Gr,
> >> Peter
> >>
> >>
>

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message