airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kaxil Naik <kaxiln...@gmail.com>
Subject Re: Airflow DAG Serialisation
Date Mon, 29 Jul 2019 10:22:35 GMT
Thanks all for the input and thanks Zhou too for the detailed AIP.

The WIP PR can be a good first step to overall optimization.

Let's sync-up on the progress you have already made & what we want to
target.

@Jarek Potiuk <Jarek.Potiuk@polidea.com> & @Fokko  - If we manage to make
it entirely backward-compatible with an enable/disable flag as we
mentioned, we can think of including it in 1.10.5 but I am in favor of
removing / cleaning stuff like pickles, drop Py 2.0 and cut Airflow 2.0 and
include this change there.




On Mon, Jul 29, 2019 at 1:03 PM Jarek Potiuk <Jarek.Potiuk@polidea.com>
wrote:

> Actually I am also doing a lot of v1-10-test merges during the last few
> months (probably several tens of them already). Rarely the conflicts are
> difficult to solve in fact. We have usually small, localised changes and
> until we go for full Black file re-formatting, we should be ok (and the
> change from Zhou seems rather small and localised).
>
> J.
>
> On Mon, Jul 29, 2019 at 9:25 AM Driesprong, Fokko <fokko@driesprong.frl>
> wrote:
>
> > I would be hesitant to merge it into 1.10.5. When I try to backport
> > anything into the 1.x branch, I get a whole bunch on merge conflicts,
> even
> > on the trivial tickets. For me, the only one who can really comment on
> this
> > would be Ash, since he's doing the bulk of the conflict resolving. Apart
> > from that, I'm really excited to make this happen!
> >
> > Cheers, Fokko
> >
> >
> >
> > Op zo 28 jul. 2019 om 20:23 schreef Jarek Potiuk <
> Jarek.Potiuk@polidea.com
> > >:
> >
> > > Some thought I have after looking at the proposal from Zhou.
> > >
> > > I think this is one of the most important things feature-wise for
> > Airflow.
> > > It looks like we have several in-progress attempts to solve the problem
> > and
> > > I guess we should agree common approach.
> > >
> > > I like very much the approach of Zhou (AIP-24). It does seem to
> minimise
> > > the changes needed in Airflow and it means that we with some
> > optimisations
> > > (caching mentioned by Fokko) - it can solve the major pain points and I
> > > think relatively quick and is potentially portable to 1.10.5 if we have
> > it.
> > >
> > > I wonder how much it overlaps/differs from what Kaxil and Ash ideas
> are.
> > If
> > > I read it correctly - it sounds like this idea will contain some more
> > > "fundamental" changes. Ones that are likely less backwards-compatible,
> > and
> > > potentially taking longer time to implement and test. And likely
> solving
> > > some of the problems better or even solving other problems. Am I right
> > with
> > > my assumptions?
> > >
> > > I think more information on this might be helpful so that we all know
> if
> > > those are two different AIPs, or whether they can be joined in one
> > effort,
> > > and how they relate to AIP-18/AIP-19 (should those be deprecated or
> > > independently implemented ?). Also - since 2.0.0 release is half a year
> > > ahead we should consider how it impact the roadmap.
> > >
> > > I can see three approaches here that we as community can follow (maybe
> I
> > am
> > > missing some :) ):
> > >
> > > 1) focus our work on single "complete" solution that will take longer
> > time
> > > and targets 2.0.0.
> > > 2) work on two of them: one quick/fast - potentially portable to
> 1.10.5m
> > > one longer-term for 2.0.0.
> > > 3) decide that the simple solution we have from Zhou (maybe with some
> > > modifications) is our target solution (for both 1.10.5 if we have it
> and
> > > 2.0.0):
> > >
> > > J.
> > >
> > > On Sat, Jul 27, 2019 at 11:43 AM Kevin Yang <yrqls21@gmail.com> wrote:
> > >
> > > > Nice job Zhou!
> > > >
> > > > Really excited, exactly what we wanted for the webserver scaling
> issue.
> > > > Want to add another big drive for Airbnb to start think about this
> > > > previously to support the effort: it can not only bring consistency
> > > between
> > > > webservers but also bring consistency between webserver and
> > > > scheduler/workers. It may be less of a problem if total DAG parsing
> > time
> > > is
> > > > small, but for us the total DAG parsing time is 15+ mins and we had
> to
> > > set
> > > > the webserver( gunicorn subprocesses) restart interval to 20 mins,
> > which
> > > > leads to a worst case 15+20+15=50 mins delay between scheduler start
> to
> > > > schedule things and users can see their deployed DAGs/changes...
> > > >
> > > > I'm not so sure about the scheduler performance improvement:
> currently
> > we
> > > > already feed the main scheduler process with SimpleDag through
> > > > DagFileProcessorManager running in a subprocess--in the future we
> feed
> > it
> > > > with data from DB, which is likely slower( tho the diff should have
> > > > negligible impact to the scheduler performance). In fact if we'd keep
> > the
> > > > existing behavior, try schedule only fresh parsed DAGs, then we may
> > need
> > > to
> > > > deal with some consistency issue--dag processor and the scheduler
> race
> > > for
> > > > updating the flag indicating if the DAG is newly parsed. No big deal
> > > there
> > > > but just some thoughts on the top of my head and hopefully can be
> > > helpful.
> > > >
> > > > And good idea on pre-rendering the template, believe template
> rendering
> > > was
> > > > the biggest concern in the previous discussion. We've also chose the
> > > > pre-rendering+JSON approach in our smart sensor API
> > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization
> > > > >
> > > > and
> > > > seems to be working fine--a supporting case for ur proposal ;)
> There's
> > a
> > > > WIP
> > > > PR <https://github.com/apache/airflow/pull/5499> for it just in
case
> > you
> > > > are interested--maybe we can even share some logics.
> > > >
> > > > Thumbs-up again for this and please don't heisitate to reach out if
> you
> > > > want to discuss further with us or need any help from us.
> > > >
> > > >
> > > > Cheers,
> > > > Kevin Y
> > > >
> > > > On Sat, Jul 27, 2019 at 12:54 AM Driesprong, Fokko
> > <fokko@driesprong.frl
> > > >
> > > > wrote:
> > > >
> > > > > Looks great Zhou,
> > > > >
> > > > > I have one thing that pops in my mind while reading the AIP; should
> > > keep
> > > > > the caching on the webserver level. As the famous quote goes:
> *"There
> > > are
> > > > > only two hard things in Computer Science: cache invalidation and
> > naming
> > > > > things." -- Phil Karlton*
> > > > >
> > > > > Right now, the fundamental change that is being proposed in the AIP
> > is
> > > > > fetching the DAGs from the database in a serialized format, and not
> > > > parsing
> > > > > the Python files all the time. This will give already a great
> > > performance
> > > > > improvement on the webserver side because it removes a lot of the
> > > > > processing. However, since we're still fetching the DAGs from the
> > > > database
> > > > > in a regular interval, cache it in the local process, so we still
> > have
> > > > the
> > > > > two issues that Airflow is suffering from right now:
> > > > >
> > > > >    1. No snappy UI because it is still polling the database in a
> > > regular
> > > > >    interval.
> > > > >    2. Inconsistency between webservers because they might poll in
a
> > > > >    different interval, I think we've all seen this:
> > > > >    https://www.youtube.com/watch?v=sNrBruPS3r4
> > > > >
> > > > > As I also mentioned in the Slack channel, I strongly feel that we
> > > should
> > > > be
> > > > > able to render most views from the tables in the database, so
> without
> > > > > touching the blob. For specific views, we could just pull the blob
> > from
> > > > the
> > > > > database. In this case we always have the latest version, and we
> > tackle
> > > > the
> > > > > second point above.
> > > > >
> > > > > To tackle the first one, I also have an idea. We should change the
> > DAG
> > > > > parser from a loop to something that uses inotify
> > > > > https://pypi.org/project/inotify_simple/. This will change it from
> > > > polling
> > > > > to an event-driven design, which is much more performant and less
> > > > resource
> > > > > hungry. But this would be an AIP on its own.
> > > > >
> > > > > Again, great design and a comprehensive AIP, but I would include
> the
> > > > > caching on the webserver to greatly improve the user experience in
> > the
> > > > UI.
> > > > > Looking forward to the opinion of others on this.
> > > > >
> > > > > Cheers, Fokko
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Op za 27 jul. 2019 om 01:44 schreef Zhou Fang
> > > > <zhoufang@google.com.invalid
> > > > > >:
> > > > >
> > > > > > Hi Kaxi,
> > > > > >
> > > > > > Just sent out the AIP:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
> > > > > >
> > > > > > Thanks!
> > > > > > Zhou
> > > > > >
> > > > > >
> > > > > > On Fri, Jul 26, 2019 at 1:33 PM Zhou Fang <zhoufang@google.com>
> > > wrote:
> > > > > >
> > > > > > > Hi Kaxil,
> > > > > > >
> > > > > > > We are also working on persisting DAGs into DB using JSON
for
> > > Airflow
> > > > > > > webserver in Google Composer. We target at minimizing the
> change
> > to
> > > > the
> > > > > > > current Airflow code. Happy to get synced on this!
> > > > > > >
> > > > > > > Here is our progress:
> > > > > > > (1) Serializing DAGs using Pickle to be used in webserver
> > > > > > > It has been launched in Composer. I am working on the PR
to
> > > upstream
> > > > > it:
> > > > > > > https://github.com/apache/airflow/pull/5594
> > > > > > > Currently it does not support non-Airflow operators and
we are
> > > > working
> > > > > on
> > > > > > > a fix.
> > > > > > >
> > > > > > > (2) Caching Pickled DAGs in DB to be used by webserver
> > > > > > > We have a proof-of-concept implementation, working on an
AIP
> now.
> > > > > > >
> > > > > > > (3) Using JSON instead of Pickle in (1) and (2)
> > > > > > > Decided to use JSON because Pickle is not secure and human
> > > readable.
> > > > > The
> > > > > > > serialization approach is very similar to (1).
> > > > > > >
> > > > > > > I will update the RP (
> > https://github.com/apache/airflow/pull/5594)
> > > > to
> > > > > > > replace Pickle by JSON, and send our design of (2) as an
AIP
> next
> > > > week.
> > > > > > > Glad to check together whether our implementation makes
sense
> and
> > > do
> > > > > > > improvements on that.
> > > > > > >
> > > > > > > Thanks!
> > > > > > > Zhou
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jul 26, 2019 at 7:37 AM Kaxil Naik <
> kaxilnaik@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >> We, at Astronomer, are going to spend time working
on DAG
> > > > > Serialisation.
> > > > > > >> There are 2 AIPs that are somewhat related to what
we plan to
> > work
> > > > on:
> > > > > > >>
> > > > > > >>    - AIP-18 Persist all information from DAG file in
DB
> > > > > > >>    <
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB
> > > > > > >> >
> > > > > > >>    - AIP-19 Making the webserver stateless
> > > > > > >>    <
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-19+Making+the+webserver+stateless
> > > > > > >> >
> > > > > > >>
> > > > > > >> We plan to use JSON as the Serialisation format and
store it
> as
> > a
> > > > blob
> > > > > > in
> > > > > > >> metadata DB.
> > > > > > >>
> > > > > > >> *Goals:*
> > > > > > >>
> > > > > > >>    - Make Webserver Stateless
> > > > > > >>    - Use the same version of the DAG across Webserver
&
> > Scheduler
> > > > > > >>    - Keep backward compatibility and have a flag (globally
&
> at
> > > DAG
> > > > > > level)
> > > > > > >>    to turn this feature on/off
> > > > > > >>    - Enable DAG Versioning (extended Goal)
> > > > > > >>
> > > > > > >>
> > > > > > >> We will be preparing a proposal (AIP) after some research
and
> > some
> > > > > > initial
> > > > > > >> work and open it for the suggestions of the community.
> > > > > > >>
> > > > > > >> We already had some good brain-storming sessions with
Twitter
> > > folks
> > > > > > (DanD
> > > > > > >> &
> > > > > > >> Sumit), folks from GoDataDriven (Fokko & Bas) &
Alex (from
> Uber)
> > > > which
> > > > > > >> will
> > > > > > >> be a good starting point for us.
> > > > > > >>
> > > > > > >> If anyone in the community is interested in it or has
some
> > > > experience
> > > > > > >> about
> > > > > > >> the same and want to collaborate please let me know
and join
> > > > > > >> #dag-serialisation channel on Airflow Slack.
> > > > > > >>
> > > > > > >> Regards,
> > > > > > >> Kaxil
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
> >
> > >
> >
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>


-- 
*Kaxil Naik*
*Big Data Consultant | DevOps Data Engineer*
*Certified *Google Cloud Data Engineer | *Certified* Apache Spark & Neo4j
Developer
*LinkedIn*: https://www.linkedin.com/in/kaxil

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message