airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bacal, Eugene" <eba...@paypal.com.INVALID>
Subject Re: Airflow Dynamic tasks
Date Wed, 21 Aug 2019 21:36:00 GMT
Hi Max, 

We have ran few testing today from DB side and noticed that:

	- DB connections are not being re-used
	- Connections stay open while active only 5:
		mysql>  show global status like 'Thread%';
		+-------------------------+---------+
		| Variable_name           | Value   |
		+-------------------------+---------+
		| Threadpool_idle_threads | 0       |
		| Threadpool_threads      | 0       |
		| Threads_cached          | 775     |
		| Threads_connected       | 5323    |
		| Threads_created         | 4846609 |
		| Threads_running         | 5       |
		+-------------------------+---------+ 
 	- Workers create hungreds of conections that’s remains open until DB cleares them (900
sec)

Seems that similar cases were reported some time back(2016), 
https://issues.apache.org/jira/browse/AIRFLOW-680


On 8/20/19, 5:19 PM, "Bacal, Eugene" <ebacal@paypal.com.INVALID> wrote:

    Celery executor.
    12 BareMetals: CPU(s):40, MHz 2494.015, RAM 378G - each box
    
    Worker .cfg:
    [core]
    sql_alchemy_pool_size = 5
    sql_alchemy_pool_recycle = 900
    sql_alchemy_reconnect_timeout = 300
    parallelism = 1200
    dag_concurrency = 800
    non_pooled_task_slot_count = 1200
    max_active_runs_per_dag = 10
    dagbag_import_timeout = 30
    [celery]
    worker_concurrency = 100
    
    Scheduler .cfg:
    [core]
    sql_alchemy_pool_size = 30
    sql_alchemy_pool_recycle = 300
    sql_alchemy_reconnect_timeout = 300
    parallelism = 1200
    dag_concurrency = 800
    non_pooled_task_slot_count = 1200
    max_active_runs_per_dag = 10
    [scheduler]
    job_heartbeat_sec = 5
    scheduler_heartbeat_sec = 5
    run_duration = 1800
    min_file_process_interval = 10
    min_file_parsing_loop_time = 1
    dag_dir_list_interval = 300
    print_stats_interval = 30
    scheduler_zombie_task_threshold = 300
    max_tis_per_query = 1024
    max_threads = 29
    
    
    From workers I see 350 + connections at the start time, then it drops to 200 and then
to 1-10 once tasks complete
    From Scheduler very low 1-10:
    MySQL connections:
        331 worker1
        215 worker2
        349 worker53
        335 worker54
        347 worker55
        336 worker56
        336 worker57
        354 worker58
        339 worker59
        328 worker60
        333 worker61
        337 worker62
         2 scheduler
    
    
    
    - Eugene
     
    
    On 8/20/19, 8:51 AM, "Maxime Beauchemin" <maximebeauchemin@gmail.com> wrote:
    
        Delay between tasks could be due to not having enough worker slots. What
        type of executor are you using, how is it configured?
        
        Max
        
        On Tue, Aug 20, 2019 at 7:50 AM Bacal, Eugene <ebacal@paypal.com.invalid>
        wrote:
        
        > Absolutely possible, Daniel,
        >
        > We are looking in all directions. Has anyone noticed performance
        > improvements with PostgreSQL vs MySQL ?
        >
        > -Eugene
        >
        >
        > On 8/15/19, 2:03 PM, "Daniel Standish" <dpstandish@gmail.com> wrote:
        >
        >     It's not just webserver and scheduler that will parse your dag file.
        >     During the execution of a dag run, dag file will be re-parsed at the
        > start
        >     of every task instance.  If you have 1000 tasks running in short
        > period of
        >     time, that's 1000 queries.  It's possible these queries are piling up
        > in a
        >     queue on your database.  Dag read time has to be very fast for this
        > reason.
        >
        >
        >
        >     On Thu, Aug 15, 2019 at 1:45 PM Bacal, Eugene
        > <ebacal@paypal.com.invalid>
        >     wrote:
        >
        >     >
        >     > Thank you for your reply, Max
        >     >
        >     > Dynamic DAGs query the database for tables and generates DAGs and
        > tasks
        >     > based on the output.
        >     > For Python does not take much to execute:
        >     >
        >     > Dynamic - 500 tasks:
        >     > time python PPAD_OIS_MASTER_IDI.py
        >     > [2019-08-15 12:57:48,522] {settings.py:174} INFO -
        >     > setting.configure_orm(): Using pool settings. pool_size=30,
        > pool_recycle=300
        >     > real    0m1.830s
        >     > user    0m1.622s
        >     > sys     0m0.188s
        >     >
        >     >
        >     > Static - 100 tasks:
        >     > time python PPAD_OPS_CANARY_CONNECTIONS_TEST_8.py
        >     > [2019-08-15 12:59:24,959] {settings.py:174} INFO -
        >     > setting.configure_orm(): Using pool settings. pool_size=30,
        > pool_recycle=300
        >     > real    0m1.009s
        >     > user    0m0.898s
        >     > sys     0m0.108s
        >     >
        >     >
        >     > We have 44 DAGs with 1003 Dynamic tasks. Parsing in quite time:
        >     > DagBag parsing time: 3.9385959999999995
        >     >
        >     > Parsing in time of execution, when scheduler submits the DAGs:
        >     > DagBag parsing time: 99.820316
        >     >
        >     > Delay between the task run inside a single DAG grow from 30 sec to
        > 10 min,
        >     > then it drops back even thou tasks are runnign.
        >     >
        >     > Eugene
        >     >
        >     >
        >     >
        >     >
        >     >
        >     > On 8/15/19, 11:52 AM, "Maxime Beauchemin" <
        > maximebeauchemin@gmail.com>
        >     > wrote:
        >     >
        >     >     What is your dynamic DAG doing? How long does it take to execute
        > it
        >     > just as
        >     >     a python script (`time python mydag.py`)?
        >     >
        >     >     As an Airflow admin, people may want to lower the DAG parsing
        > timeout
        >     >     configuration key to force people to not do crazy thing in DAG
        > module
        >     >     scope. At some point at Airbnb we had someone running a Hive
        > query in
        >     > DAG
        >     >     scope, clearly that needs to be prevented.
        >     >
        >     >     Loading DAGs by calling a database can bring all sorts of
        > surprises
        >     > that
        >     >     can drive everyone crazy. As mentioned in a recent post,
        >     > repo-contained,
        >     >     deterministic "less dynamic" DAGs are great, because they are
        >     >     self-contained and allow you to use source-control properly
        > (revert a
        >     > bad
        >     >     change for instance). That may mean having a process or script
        > that
        >     >     compiles external things that are dynamic into things like yaml
        > files
        >     >     checked into the code repo. Things as simple as parsing duration
        > become
        >     >     more predictable (network latency and database load are not part
        > of
        >     > that
        >     >     equation), but more importantly, all changes become tracked in
        > the code
        >     >     repo.
        >     >
        >     >     yaml parsing in python can be pretty slow too, and there are
        > solutions
        >     > /
        >     >     alternatives there. Hocon is great. Also C-accelerated yaml is
        >     > possible:
        >     >
        >     >
        > https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F27743711%2Fcan-i-speedup-yaml&amp;data=01%7C01%7Cebacal%40paypal.com%7C0f8d708a32554e12cb0b08d725cd29f6%7Cfb00791460204374977e21bac5f3f4c8%7C1&amp;sdata=MrSRQT5qMQLcc5RZxfLABss7%2BXPQWN5Om%2Bxjx8HhD8s%3D&amp;reserved=0
        >     >
        >     >     Max
        >     >
        >     >     On Wed, Aug 14, 2019 at 9:56 PM Bacal, Eugene
        >     > <ebacal@paypal.com.invalid>
        >     >     wrote:
        >     >
        >     >     > Hello Airflow team,
        >     >     >
        >     >     > Please advise if you can. In our environment, we have noticed
        > that
        >     > dynamic
        >     >     > tasks place quite of stress on scheduler, webserver and
        > increase
        >     > MySQL DB
        >     >     > connections.
        >     >     > We are run about 1000 Dynamic Tasks every 30 min and parsing
        > time
        >     >     > increases from 5 to 65 sec with Runtime from 2sec to 350+ .
        > This
        >     > happens at
        >     >     > execution time then it drops to normal while still executing
        > tasks.
        >     >     > Webserver hangs for few minutes.
        >     >     >
        >     >     > Airflow 1.10.1.
        >     >     > MySQL DB
        >     >     >
        >     >     > Example:
        >     >     >
        >     >     > Dynamic Tasks:
        >     >     > Number of DAGs: 44
        >     >     > Total task number: 950
        >     >     > DagBag parsing time: 65.879642000000001
        >     >     >
        >     >     > Static Tasks:
        >     >     > Number of DAGs: 73
        >     >     > Total task number: 1351
        >     >     > DagBag parsing time: 1.731088
        >     >     >
        >     >     > Is this something you aware of? Any advises on Dynamic tasks
        >     >     > optimization/best practices?
        >     >     >
        >     >     > Thank you in advance,
        >     >     > Eugene
        >     >     >
        >     >     >
        >     >     >
        >     >
        >     >
        >     >
        >
        >
        >
        
    
    

Mime
View raw message