airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bacal, Eugene" <eba...@paypal.com.INVALID>
Subject Re: Airflow Dynamic tasks
Date Tue, 20 Aug 2019 14:49:25 GMT
Absolutely possible, Daniel, 

We are looking in all directions. Has anyone noticed performance improvements with PostgreSQL
vs MySQL ?

-Eugene
 

´╗┐On 8/15/19, 2:03 PM, "Daniel Standish" <dpstandish@gmail.com> wrote:

    It's not just webserver and scheduler that will parse your dag file.
    During the execution of a dag run, dag file will be re-parsed at the start
    of every task instance.  If you have 1000 tasks running in short period of
    time, that's 1000 queries.  It's possible these queries are piling up in a
    queue on your database.  Dag read time has to be very fast for this reason.
    
    
    
    On Thu, Aug 15, 2019 at 1:45 PM Bacal, Eugene <ebacal@paypal.com.invalid>
    wrote:
    
    >
    > Thank you for your reply, Max
    >
    > Dynamic DAGs query the database for tables and generates DAGs and tasks
    > based on the output.
    > For Python does not take much to execute:
    >
    > Dynamic - 500 tasks:
    > time python PPAD_OIS_MASTER_IDI.py
    > [2019-08-15 12:57:48,522] {settings.py:174} INFO -
    > setting.configure_orm(): Using pool settings. pool_size=30, pool_recycle=300
    > real    0m1.830s
    > user    0m1.622s
    > sys     0m0.188s
    >
    >
    > Static - 100 tasks:
    > time python PPAD_OPS_CANARY_CONNECTIONS_TEST_8.py
    > [2019-08-15 12:59:24,959] {settings.py:174} INFO -
    > setting.configure_orm(): Using pool settings. pool_size=30, pool_recycle=300
    > real    0m1.009s
    > user    0m0.898s
    > sys     0m0.108s
    >
    >
    > We have 44 DAGs with 1003 Dynamic tasks. Parsing in quite time:
    > DagBag parsing time: 3.9385959999999995
    >
    > Parsing in time of execution, when scheduler submits the DAGs:
    > DagBag parsing time: 99.820316
    >
    > Delay between the task run inside a single DAG grow from 30 sec to 10 min,
    > then it drops back even thou tasks are runnign.
    >
    > Eugene
    >
    >
    >
    >
    >
    > On 8/15/19, 11:52 AM, "Maxime Beauchemin" <maximebeauchemin@gmail.com>
    > wrote:
    >
    >     What is your dynamic DAG doing? How long does it take to execute it
    > just as
    >     a python script (`time python mydag.py`)?
    >
    >     As an Airflow admin, people may want to lower the DAG parsing timeout
    >     configuration key to force people to not do crazy thing in DAG module
    >     scope. At some point at Airbnb we had someone running a Hive query in
    > DAG
    >     scope, clearly that needs to be prevented.
    >
    >     Loading DAGs by calling a database can bring all sorts of surprises
    > that
    >     can drive everyone crazy. As mentioned in a recent post,
    > repo-contained,
    >     deterministic "less dynamic" DAGs are great, because they are
    >     self-contained and allow you to use source-control properly (revert a
    > bad
    >     change for instance). That may mean having a process or script that
    >     compiles external things that are dynamic into things like yaml files
    >     checked into the code repo. Things as simple as parsing duration become
    >     more predictable (network latency and database load are not part of
    > that
    >     equation), but more importantly, all changes become tracked in the code
    >     repo.
    >
    >     yaml parsing in python can be pretty slow too, and there are solutions
    > /
    >     alternatives there. Hocon is great. Also C-accelerated yaml is
    > possible:
    >
    > https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F27743711%2Fcan-i-speedup-yaml&amp;data=01%7C01%7Cebacal%40paypal.com%7C52e34ece9af5449f231708d721c41370%7Cfb00791460204374977e21bac5f3f4c8%7C1&amp;sdata=bpUaQLCbkvcmxSZe9hKW4FaCgxwpX8BTuHNO9wYHpN0%3D&amp;reserved=0
    >
    >     Max
    >
    >     On Wed, Aug 14, 2019 at 9:56 PM Bacal, Eugene
    > <ebacal@paypal.com.invalid>
    >     wrote:
    >
    >     > Hello Airflow team,
    >     >
    >     > Please advise if you can. In our environment, we have noticed that
    > dynamic
    >     > tasks place quite of stress on scheduler, webserver and increase
    > MySQL DB
    >     > connections.
    >     > We are run about 1000 Dynamic Tasks every 30 min and parsing time
    >     > increases from 5 to 65 sec with Runtime from 2sec to 350+ . This
    > happens at
    >     > execution time then it drops to normal while still executing tasks.
    >     > Webserver hangs for few minutes.
    >     >
    >     > Airflow 1.10.1.
    >     > MySQL DB
    >     >
    >     > Example:
    >     >
    >     > Dynamic Tasks:
    >     > Number of DAGs: 44
    >     > Total task number: 950
    >     > DagBag parsing time: 65.879642000000001
    >     >
    >     > Static Tasks:
    >     > Number of DAGs: 73
    >     > Total task number: 1351
    >     > DagBag parsing time: 1.731088
    >     >
    >     > Is this something you aware of? Any advises on Dynamic tasks
    >     > optimization/best practices?
    >     >
    >     > Thank you in advance,
    >     > Eugene
    >     >
    >     >
    >     >
    >
    >
    >
    

Mime
View raw message