airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trent Robbins <robbi...@gmail.com>
Subject Re: Getting Task Killed Externally
Date Tue, 28 Aug 2018 08:04:18 GMT
We saw the same thing. Only a few truly active tasks yet the task queue was
filling up with pending tasks.

Best,
Trent

On Tue, Aug 28, 2018 at 12:47 AM Vardan Gupta <vardanguptacse@gmail.com>
wrote:

> Hi Trent,
>
> Thanks for replying. Though you're suggesting that there might be case
> where we might be hitting caps, but on our side, there are hardly any
> concurrent tasks, rarely 1-2 at a time with parallelism set to 50. But
> yeah, we'll just increase the parallelism and and see if that solves the
> problem too.
>
> Thanks,
> Vardan Gupta
>
> On Tue, Aug 28, 2018 at 11:17 AM Trent Robbins <robbintt@gmail.com> wrote:
>
> > Hi Vardan,
> >
> > We had this issue - I recommend increasing the parallelism config
> variable
> > to something like 128 or 512. I have no idea what side effects this could
> > have. So far, none. This happened to us with LocalExecutor and our
> > monitoring showed a clear issue with hitting a cap on number of
> concurrent
> > tasks tasks. I probably should have reported it, but we still aren't sure
> > what happened and have not investigated why those tasks are not getting
> > kicked back up into the queue or whatever.
> >
> > You may need to increase other config variables, too, if they also cause
> > you to hit caps. Some people are conservative about these variables. If
> you
> > are feeling conservative, you can get some better telemetry into this
> with
> > prometheus and grafana. We followed this route but resolved to just set
> the
> > cap very high and resolve any side effects afterwards.
> >
> > Best,
> > Trent
> >
> >
> > On Mon, Aug 27, 2018 at 21:09 vardanguptacse@gmail.com <
> > vardanguptacse@gmail.com> wrote:
> >
> > > Hi Everyone,
> > >
> > > Since last 2 weeks, we're facing an issue with LocalExecutor setup of
> > > Airflow v1.9(MySQL as metastore) where in a DAG if retry has been
> > > configured and initial try_number gets failed, then nearly 8 out of 10
> > > times, task will get stuck in up_for_retry state, in fact there is no
> > > running state coming after Scheduled>Queued in TI. In Job table entry
> > gets
> > > successful within fraction of second and failed entry gets logged in
> > > task_fail table without task even reaching to operator code and as a
> > result
> > > we get aemail alert saying
> > >
> > > ```
> > > Try 2 out of 4
> > > Exception:
> > > Executor reports task instance %s finished (%s) although the task says
> > its
> > > %s. Was the task killed externally?
> > > ```
> > >
> > > But when default value of job_heartbeat_sec changed from 5 to 30
> seconds(
> > > https://groups.google.com/forum/#!topic/airbnb_airflow/hTXKFw2XFx0
> > > mentioned by Max sometimes back in 2016 for healthy supervision), this
> > > issue stops arising. But we're still clueless how this new
> configuration
> > > actually solved/suppressed the issue, any key information around it
> would
> > > really help here.
> > >
> > > Regards,
> > > Vardan Gupta
> > >
> > --
> > (Sent from cellphone)
>
-- 
(Sent from cellphone)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message