airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bacal, Eugene" <eba...@paypal.com.INVALID>
Subject Re: Airflow Dynamic tasks
Date Thu, 15 Aug 2019 20:44:30 GMT

Thank you for your reply, Max

Dynamic DAGs query the database for tables and generates DAGs and tasks based on the output.

For Python does not take much to execute:

Dynamic - 500 tasks:
time python PPAD_OIS_MASTER_IDI.py
[2019-08-15 12:57:48,522] {settings.py:174} INFO - setting.configure_orm(): Using pool settings.
pool_size=30, pool_recycle=300
real	0m1.830s
user	0m1.622s
sys	0m0.188s


Static - 100 tasks:
time python PPAD_OPS_CANARY_CONNECTIONS_TEST_8.py
[2019-08-15 12:59:24,959] {settings.py:174} INFO - setting.configure_orm(): Using pool settings.
pool_size=30, pool_recycle=300
real	0m1.009s
user	0m0.898s
sys	0m0.108s


We have 44 DAGs with 1003 Dynamic tasks. Parsing in quite time:
DagBag parsing time: 3.9385959999999995

Parsing in time of execution, when scheduler submits the DAGs:
DagBag parsing time: 99.820316

Delay between the task run inside a single DAG grow from 30 sec to 10 min, then it drops back
even thou tasks are runnign. 

Eugene
 




´╗┐On 8/15/19, 11:52 AM, "Maxime Beauchemin" <maximebeauchemin@gmail.com> wrote:

    What is your dynamic DAG doing? How long does it take to execute it just as
    a python script (`time python mydag.py`)?
    
    As an Airflow admin, people may want to lower the DAG parsing timeout
    configuration key to force people to not do crazy thing in DAG module
    scope. At some point at Airbnb we had someone running a Hive query in DAG
    scope, clearly that needs to be prevented.
    
    Loading DAGs by calling a database can bring all sorts of surprises that
    can drive everyone crazy. As mentioned in a recent post, repo-contained,
    deterministic "less dynamic" DAGs are great, because they are
    self-contained and allow you to use source-control properly (revert a bad
    change for instance). That may mean having a process or script that
    compiles external things that are dynamic into things like yaml files
    checked into the code repo. Things as simple as parsing duration become
    more predictable (network latency and database load are not part of that
    equation), but more importantly, all changes become tracked in the code
    repo.
    
    yaml parsing in python can be pretty slow too, and there are solutions /
    alternatives there. Hocon is great. Also C-accelerated yaml is possible:
    https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F27743711%2Fcan-i-speedup-yaml&amp;data=01%7C01%7Cebacal%40paypal.com%7Cb01b585b5bf348b7ee4808d721b1c363%7Cfb00791460204374977e21bac5f3f4c8%7C1&amp;sdata=n05lhbbyxOVY96UgCkOOg7zRVZD0KD78oD98RotL224%3D&amp;reserved=0
    
    Max
    
    On Wed, Aug 14, 2019 at 9:56 PM Bacal, Eugene <ebacal@paypal.com.invalid>
    wrote:
    
    > Hello Airflow team,
    >
    > Please advise if you can. In our environment, we have noticed that dynamic
    > tasks place quite of stress on scheduler, webserver and increase MySQL DB
    > connections.
    > We are run about 1000 Dynamic Tasks every 30 min and parsing time
    > increases from 5 to 65 sec with Runtime from 2sec to 350+ . This happens at
    > execution time then it drops to normal while still executing tasks.
    > Webserver hangs for few minutes.
    >
    > Airflow 1.10.1.
    > MySQL DB
    >
    > Example:
    >
    > Dynamic Tasks:
    > Number of DAGs: 44
    > Total task number: 950
    > DagBag parsing time: 65.879642000000001
    >
    > Static Tasks:
    > Number of DAGs: 73
    > Total task number: 1351
    > DagBag parsing time: 1.731088
    >
    > Is this something you aware of? Any advises on Dynamic tasks
    > optimization/best practices?
    >
    > Thank you in advance,
    > Eugene
    >
    >
    >
    

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message