airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Brard <emmanuel.br...@getyourguide.com>
Subject Re: airflow 1.9 tasks randomly failing | k8 - hive
Date Mon, 10 Sep 2018 07:57:09 GMT
Hello,

You can see the zombie tasks killing in the scheduler logs (text file), I
think you need the 'INFO' log level though. The detection is based on the
time difference with the last heartbeat in the job table.

Best,
E

On Fri, Aug 24, 2018 at 4:36 PM Dubey, Somesh <somesh.dubey@nordstrom.com>
wrote:

> Thanks so much Emmanuel for the reply.
> How did Airflow classified those processes as Zombie.
> Did any log say that.
> As we are not able to get any good logs of tasks failing.
>
> Thanks so much- You are potentially saving hours of work/sleepless nights
> as things are breaking production for us.
> Somesh
>
> -----Original Message-----
> From: Emmanuel Brard [mailto:emmanuel.brard@getyourguide.com]
> Sent: Friday, August 24, 2018 6:48 AM
> To: dev@airflow.incubator.apache.org
> Subject: Re: airflow 1.9 tasks randomly failing | k8 - hive
>
> Hey,
>
> We have a similar setup with Airflow 1.9 on Kubernetes with the Celery
> executor.
>
> We saw some airflow tasks being killed inside the container because of
> cgroup limits (set by kubernetes and pushed to the docker daemon) but not
> airflow itself (airflow celery command) which ended up in zombie being
> identified by the scheduler (setting the task to fail). We decreased the
> number of celery slots on each worker reducing memory pressure and this
> stoped. It happened mostly with one DAG which had simple operators but was
> quite big with subdags involved.
>
> Maybe you are facing the same issue ?
>
> Bye,
> E
>
> On Fri, Aug 24, 2018 at 3:40 PM Dubey, Somesh <somesh.dubey@nordstrom.com>
> wrote:
>
> > We are on AirFlow 1.9 on Kubernetes. We have many hive DAGs which we
> > call via JDBC (via rest end point) and randomly tasks fail. One day
> > one would fail and next day some other.
> > The pods are all healthy and there is no node eviction/any issues and
> > retries typically works. Also even when the tasks fails it completes
> > on the hive side successfully.
> > We do not get good logs of the failure and typically get partial sql
> > in sysout in the task log with no error message. Loglevel increased to
> > 5 in jdbc connection.
> > This only happens in production.
> >
> > Not been able to reproduce the issue in dev.
> > The closest we have gotten in dev to replicate similar behavior is to
> > kill the pid (manually killing airflow run --raw) in worker node.
> > That case also things run fine on hive side but task fails in Airflow.
> > Similar task log with no error is received.
> >
> > Have you seen this kind of behavior. Any help is much appreciated.
> >
> > Thanks so much,
> > Somesh
> >
> >
> >
> >
> >
>
> --
>
>
>
>
>
>
>
>
> GetYourGuide AG
>
> Stampfenbachstrasse 48
>
> 8006 Zürich
>
> Switzerland
>
>
>
>  <https://www.facebook.com/GetYourGuide>
> <https://twitter.com/GetYourGuide>
> <https://www.instagram.com/getyourguide/>
> <https://www.linkedin.com/company/getyourguide-ag>
> <http://www.getyourguide.com>
>
>
>
>
>
>
>
>

-- 








GetYourGuide AG

Stampfenbachstrasse 48  

8006 Zürich

Switzerland



 <https://www.facebook.com/GetYourGuide>  
<https://twitter.com/GetYourGuide>  
<https://www.instagram.com/getyourguide/>  
<https://www.linkedin.com/company/getyourguide-ag>  
<http://www.getyourguide.com>








Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message