airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Chen <chingchien.c...@gmail.com>
Subject Re: High load in CPU of MySQL when running airflow
Date Wed, 08 Mar 2017 01:39:11 GMT
Hi Bolke,

 Thanks, but it looks you are actually talking about Harish's use case.

 My use case is about 50 Dags (each one with about 2-3 tasks). I feel our
run interval setting for the dags are too low (~15 mins). It may result in
high CPU of MySQL.

 Meanwhile, I dig to MySQL and I noticed a frequently running SQL statement
as below. It's without proper index on column task_instance.state.

Shouldn't it index "state", given that there could be million of rows in
task_instance?

SQL Statement:
"SELECT task_instance.task_id AS task_instance_task_id,
task_instance.dag_id AS task_instance_dag_id,.... FROM task_instance WHERE
task_instance.state = 'queued'"

Also, is there a possibility to clean some "unneeded" entries in the tables
(say, task_instance) ?  I mean, for example, removing task states older
than 6 months?

Feedback are welcome.

Thanks.

-Jason



On Tue, Mar 7, 2017 at 11:45 AM, Bolke de Bruin <bdbruin@gmail.com> wrote:

> Hi Jason
>
> I think you need to back it up with more numbers. You assume that a load
> of 100% is bad and also that 16GB of mem is a lot.
>
> 30x25 = 750 tasks per hour = 12,5 tasks per minute. For every task we
> launch a couple of processes (at least 2) that do not share memory, this is
> to ensure tasks cannot hurt each other. Curl tasks are probably launched by
> using a BashOperator, which means another process. Curl is itself another
> process. So 4 processes per task, that cannot share memory. Curl can cache
> memory itself as well. You probably have peak times and longer running
> tasks so it is not evenly spread, then it starts adding up quickly?
>
> Bolke.
>
>
> > On 7 Mar 2017, at 19:41, Jason Chen <chingchien.chen@gmail.com> wrote:
> >
> > Hi Harish,
> > Thanks for the fast response and feedback.
> > Yeah, I want to see the fix or more discussion !
> >
> > BTW, I assume that, given your 30 dags, airflow runs fine after your
> > increase of heartbeat ?
> > The default is 5 secs.
> >
> >
> > Thanks.
> > Jason
> >
> >
> > On Tue, Mar 7, 2017 at 10:24 AM, harish singh <harish.singh22@gmail.com>
> > wrote:
> >
> >> I had seen a similar behavior, a year ago, when we were are < 5 Dags.
> Even
> >> then the cpu utilization was reaching 100%.
> >> One way to deal with this is - You could play with "heatbeat" numbers
> (i.e
> >> increase heartbeat).
> >> But then you are introducing more delay to start jobs that are ready to
> run
> >> (ready to be queued -> queued -> run)
> >>
> >> Right now, we have more than 30 dags (each with ~ 20-25 tasks) that runs
> >> every hour.
> >> We are giving airflow about 5-6 cores (which still seems less for
> airflow).
> >> Also, for so many tasks every hour,  our mem consumption is over 16G.
> >> All our tasks are basically doing "curl". So 16G seems too high.
> >>
> >> Having said that, I remember reading somewhere that there was a fix
> coming
> >> for this.
> >> If not, I would definitely want to see more discussion on this.
> >>
> >> Thanks for opening this. I would love to hear on how people are working
> >> around this.
> >>
> >>
> >>
> >>
> >>
> >> On Tue, Mar 7, 2017 at 9:42 AM, Jason Chen <chingchien.chen@gmail.com>
> >> wrote:
> >>
> >>> Hi  team,
> >>>
> >>> We are using airflow v1.7.1.3 and schedule about 50 dags (each dags is
> >>> about 10 to one hour intervals). It's with LocalExecutor.
> >>>
> >>> Recently, we noticed the RDS (MySQL 5.6.x with AWS) runs with ~100%
> CPU.
> >>> I am wondering if airflow scheduler and webserver can cause high CPU
> load
> >>> of MySQL, given ~50 dags?
> >>> I feel MySQL should be light load..
> >>>
> >>> Thanks.
> >>> -Jason
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message