airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From harish singh <harish.sing...@gmail.com>
Subject Re: airflow (1.7.0): cpu utilization reaches 70% and above
Date Mon, 13 Jun 2016 17:58:08 GMT
Another thing (just digging in more. sorry :) ) :
When I set scheduler_heartbeat_sec to 60 seconds, I see this in my logs:


*[2016-06-13 17:53:59,551] {models.py:240} INFO - Failing jobs without
heartbeat after 2016-06-13 17:50:29.551071*
Is this because my job_heartbeat_sec is set to 5 seconds.

Also, there is not much documentation for [scheduler] settings. I think it
will help people to get started on tweaking these.

What is the general guideline on how to set job_heartbeat_sec and
scheduler_heartbeat_sec?
Should both be always the same?

On Mon, Jun 13, 2016 at 10:39 AM, Ben Tallman <ben@apigee.com> wrote:

> In the past, I have written/seen systems where the pattern us that a task
> runner/worker is in charge of handling scheduling of the next tasks that
> need to run on completion of a task and the Scheduler only handles "issues"
> and initial kickoffs...
>
> Opens another can of worms, but I think I've seen discussion on here of
> that idea and I thought I'd put my vote in for that pattern.
>
> On Mon, Jun 13, 2016 at 10:35 AM harish singh <harish.singh22@gmail.com>
> wrote:
>
> > So I changed the scheduler heartbeat to 60 sec
> >
> > [scheduler]
> > job_heartbeat_sec = 5
> > scheduler_heartbeat_sec = 60
> >
> > As expected, we did see a spiky cpu utilization.
> >
> >
> > But I see a different problem (not so much of a problem, but just
> > putting it here so that it may help someone who may need to do
> > something similar)
> >
> > Suppose I have a DAG with Tasks as below:A-->B-->C
> >
> > If A starts and completes its execution in 10 secs.
> >
> > The next task B cannot start until the next scheduler heartbeat. That
> > means, I may have to wait roughly around 60 seconds in the worst case.
> >
> > So one option is to have a heartbeat of around 30 seconds
> >  (or duration of whatever you believe is your least time consuming
> > task). This would just be a rough optimization to make sure we make
> > progress soon enough after the end of a task.
> >
> >
> > Thanks,Harish
> >
> >
> > On Mon, Jun 13, 2016 at 9:50 AM, harish singh <harish.singh22@gmail.com>
> > wrote:
> >
> > > yup. it is the scheduler process using cpu:
> > >
> > >
> > > The below is the usage with default out-of-box heartbeat settings.
> > >  I will paste some numbers in something, as I play with airflow.cfg.
> > > This is the output of `top`
> > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
> > COMMAND
> > > *   65 airflow   20   0  445488 112648  13308 R  61.3  0.7 834:21.23
> > > airflow*
> > >     1 airflow   20   0   17968   1912   1668 S   0.0  0.0   0:00.04
> > > entrypoint.sh
> > >    64 airflow   20   0  407972  74876  13328 S   0.0  0.5   0:01.12
> > airflow
> > >   154 airflow   20   0  404404  62528   4344 S   0.0  0.4   0:00.11
> > airflow
> > >   155 airflow   20   0  404404  62532   4344 S   0.0  0.4   0:00.11
> > airflow
> > >   156 airflow   20   0  404404  62532   4344 S   0.0  0.4   0:00.12
> > airflow
> > >   157 airflow   20   0  404404  62532   4344 S   0.0  0.4   0:00.11
> > airflow
> > >   158 airflow   20   0  404404  62532   4344 S   0.0  0.4   0:00.11
> > airflow
> > >   159 airflow   20   0  404404  62532   4344 S   0.0  0.4   0:00.11
> > airflow
> > >   160 airflow   20   0  404404  62488   4300 S   0.0  0.4   0:00.12
> > airflow
> > >   161 airflow   20   0  404404  62468   4280 S   0.0  0.4   0:00.11
> > airflow
> > >   163 airflow   20   0   50356  16680   5956 S   0.0  0.1   0:08.86
> > > gunicorn: maste
> > >   168 airflow   20   0  457204 119940  13080 S   0.0  0.8   0:18.86
> > > gunicorn: worke
> > >   170 airflow   20   0  463168 126028  13080 S   0.0  0.8   0:13.81
> > > gunicorn: worke
> > >   171 airflow   20   0  464936 127672  13080 S   0.0  0.8   0:08.53
> > > gunicorn: worke
> > >   174 airflow   20   0  467460 130240  13080 S   0.0  0.8   0:08.52
> > > gunicorn: worke
> > >  7627 airflow   20   0   18208   3192   2672 S   0.0  0.0   0:00.04
> bash
> > >  8091 airflow   20   0   30312   7556   4784 S   0.0  0.0   0:00.00
> > python
> > > 28808 airflow   20   0   18208   3352   2836 S   0.0  0.0   0:00.08
> bash
> > > 28822 airflow   20   0   19844   2340   2020 R   0.0  0.0   0:00.01 top
> > >
> > >
> > > *PID 65 is the scheduler.*
> > >
> > >
> > > airflow@80be4e775e55:~$ ps -ef
> > > UID        PID  PPID  C STIME TTY          TIME CMD
> > > airflow      1     0  0 Jun12 ?        00:00:00 /bin/bash
> ./entrypoint.sh
> > > airflow     64     1  0 Jun12 ?        00:00:01 /usr/bin/python
> > > /usr/local/bin/airflow webserver -p 8080
> > > airflow     65     1 62 Jun12 ?        13:54:22 /usr/bin/python
> > > /usr/local/bin/airflow scheduler
> > > airflow    154    65  0 Jun12 ?        00:00:00 /usr/bin/python
> > > /usr/local/bin/airflow scheduler
> > > airflow    155    65  0 Jun12 ?        00:00:00 /usr/bin/python
> > > /usr/local/bin/airflow scheduler
> > > airflow    156    65  0 Jun12 ?        00:00:00 /usr/bin/python
> > > /usr/local/bin/airflow scheduler
> > > airflow    157    65  0 Jun12 ?        00:00:00 /usr/bin/python
> > > /usr/local/bin/airflow scheduler
> > > airflow    158    65  0 Jun12 ?        00:00:00 /usr/bin/python
> > > /usr/local/bin/airflow scheduler
> > > airflow    159    65  0 Jun12 ?        00:00:00 /usr/bin/python
> > > /usr/local/bin/airflow scheduler
> > > airflow    160    65  0 Jun12 ?        00:00:00 /usr/bin/python
> > > /usr/local/bin/airflow scheduler
> > > airflow    161    65  0 Jun12 ?        00:00:00 /usr/bin/python
> > > /usr/local/bin/airflow scheduler
> > > airflow    163    64  0 Jun12 ?        00:00:08 gunicorn: master
> > > [airflow.www.app:cached_app()]
> > > airflow    168   163  0 Jun12 ?        00:00:18 gunicorn: worker
> > > [airflow.www.app:cached_app()]
> > > airflow    170   163  0 Jun12 ?        00:00:13 gunicorn: worker
> > > [airflow.www.app:cached_app()]
> > > airflow    171   163  0 Jun12 ?        00:00:08 gunicorn: worker
> > > [airflow.www.app:cached_app()]
> > > airflow    174   163  0 Jun12 ?        00:00:08 gunicorn: worker
> > > [airflow.www.app:cached_app()]
> > > airflow   7627     0  0 Jun12 ?        00:00:00 bash
> > > airflow   8091  7627  0 Jun12 ?        00:00:00 python
> > > airflow  28808     0  0 16:44 ?        00:00:00 bash
> > > airflow  28823 28808  0 16:44 ?        00:00:00 ps -ef
> > >
> > > On Mon, Jun 13, 2016 at 8:42 AM, Maxime Beauchemin <
> > > maximebeauchemin@gmail.com> wrote:
> > >
> > >> Can you confirm that it's the scheduler process using that CPU?
> > >>
> > >> The SCHEDULER_HEARTBEAT_SEC configuration defines a minimum duration
> for
> > >> scheduling cycles, where the scheduler evaluates all active DagRun and
> > >> attempts to kick off task instances whose dependencies are met. Once
> the
> > >> cycle is done, the scheduler should sleep until the next heartbeat, so
> > CPU
> > >> should look spiky.
> > >>
> > >> Max
> > >>
> > >> On Mon, Jun 13, 2016 at 8:26 AM, harish singh <
> harish.singh22@gmail.com
> > >
> > >> wrote:
> > >>
> > >> > Yup, I tried changing the scheduler heartbeat to 60 seconds..
> > >> > Apart from not getting any update for 60 seconds, What are the side
> > >> effects
> > >> > of changing the two heartbeats? Shouldn't impact performance?
> > >> >
> > >> > Also, I understand this cpu usage if there are 100s of dags. But
> with
> > >> just
> > >> > one active dag, doesnt 70% seem high? Esp in my case where there are
> > >> only
> > >> > 10 tasks in the dag making only curls (BashOperators).
> > >> >
> > >> > Also, a side now, in a different environment where we have 10 dags
> > >> active,
> > >> > the cpu usage stays in the same 70-80% range.
> > >> >
> > >> > On Mon, Jun 13, 2016, 8:14 AM Maxime Beauchemin <
> > >> > maximebeauchemin@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > The scheduler constantly attempts to schedule tasks, interacting
> > with
> > >> the
> > >> > > database and reloading DAG definition. In most larg-ish
> > environments,
> > >> > > burning up to a CPU to run the scheduler doesn't seem outrageous
> to
> > >> me.
> > >> > >
> > >> > > If you want to reduce the CPU load related to the scheduler check
> > out
> > >> > > SCHEDULER_HEARTBEAT_SEC and MAX_THREADS in the scheduler section
> of
> > >> > > `airflow.cfg`
> > >> > >
> > >> > > Max
> > >> > >
> > >> > > On Sun, Jun 12, 2016 at 1:24 PM, harish singh <
> > >> harish.singh22@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > Hi guys,
> > >> > > >
> > >> > > > We are running airflow (for about 3 months now) inside a
docker
> > >> > container
> > >> > > > on aws.
> > >> > > >
> > >> > > > I just did a docker stats to check whats going on. The cpu
> > >> consumption
> > >> > is
> > >> > > > huge.
> > >> > > > We have around 15 DAGS. Only one DAG is turned ON. the remaining
> > are
> > >> > OFF.
> > >> > > > The DAG runs with a HOURLY schedule.
> > >> > > >
> > >> > > > Right now, airflow is consuming almost 1 complete core.
> > >> > > > It seems there is some unnecessary spinning?
> > >> > > > This doesnt look like the right behavior.
> > >> > > > Is there a bug already filed for this? Or am not sure if
there
> is
> > >> > > something
> > >> > > > incorrect in the way I am using the airflow configuration.
> > >> > > >
> > >> > > > CONTAINER           CPU %               MEM USAGE / LIMIT
>  MEM
> > %
> > >> > > >       NET I/O               BLOCK I/O
> > >> > > > CCC                         68.17%              619.7 MB
/ 2.147
> > GB
> > >> > > > 28.85%              1.408 GB / 939.4 MB   7.856 MB / 0 B
> > >> > > > XXX                          64.36%              619.4 MB
/
> 2.147
> > GB
> > >> > > > 28.84%              1.211 GB / 807.6 MB   7.856 MB / 0 B
> > >> > > >
> > >> > > >
> > >> > > > Ariflow version 1.7.0
> > >> > > >
> > >> > > > Airflow config:
> > >> > > >
> > >> > > > sql_alchemy_pool_size = 5
> > >> > > > sql_alchemy_pool_recycle = 3600
> > >> > > > parallelism = 8
> > >> > > > dag_concurrency = 8
> > >> > > > max_active_runs_per_dag = 8
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > Harish
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message