Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 31E91200C50 for ; Sat, 25 Mar 2017 03:08:46 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 307D5160B96; Sat, 25 Mar 2017 02:08:46 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 50023160B93 for ; Sat, 25 Mar 2017 03:08:45 +0100 (CET) Received: (qmail 25015 invoked by uid 500); 25 Mar 2017 02:08:44 -0000 Mailing-List: contact dev-help@airflow.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.incubator.apache.org Delivered-To: mailing list dev@airflow.incubator.apache.org Received: (qmail 25000 invoked by uid 99); 25 Mar 2017 02:08:44 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 Mar 2017 02:08:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C0DCECEC41 for ; Sat, 25 Mar 2017 02:08:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.629 X-Spam-Level: ** X-Spam-Status: No, score=2.629 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id brD0ju1pr7TQ for ; Sat, 25 Mar 2017 02:08:41 +0000 (UTC) Received: from mail-vk0-f49.google.com (mail-vk0-f49.google.com [209.85.213.49]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id C65105FB72 for ; Sat, 25 Mar 2017 02:08:40 +0000 (UTC) Received: by mail-vk0-f49.google.com with SMTP id z204so6804777vkd.1 for ; Fri, 24 Mar 2017 19:08:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=8N4SIJIxeGrMCMgKC5ZH9EFfHfAI6IvsHhlEZ9YMMyA=; b=NZZyLtN/LQukW0eMaGJbU4UqoriNY0gnUQWwb+QvMv04iBp/VpvFGzsyjgiES7nObS sCgsvXFQBAuegLAtM3aB5jkglHqc8oQwmujkuigpopcnfVLsXfSjMkykm5wU3aeh+8zi JmUO9yrfw6+8x1w1DUEcUbjyXidVT9BHMa629lOM8aJbpy0Re0U0XAM1Jcandryg3awR S/frYmwRQ1X36nZYfb+ipBS3gYgTgMRxE1or38N7i6sbx+G2axmTVLRe9wWY0Xnj9Lne ATIIdfA7zxfUCG8DphpeCXaFlQvj9xD8eewe5Df6REculdWzBeHNCyjwrRokd8Z/UKbH w9pw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=8N4SIJIxeGrMCMgKC5ZH9EFfHfAI6IvsHhlEZ9YMMyA=; b=KQPidmo4gg2csYo2OfZizn+ThClD0RjWsXm9haTeRvMqX6+pco3pFd83CDwD2JQvbY WEjffQsSvWT4yqJynsfIky8J65hB2htMPFpsXtbGPOrOf/g8X0G+eRLJIV1cU5oqnRtr M/0bcwYK60T/8NVPLXK6lQZ55Qcqnz66Q+oLFFOyMXSkyQ84old+CudkQqHNsBDHZJDt ua3jRAiPZPDvuqB62ckndTxEMTABCcnP0Z4s4bj5IdIucIqHhnQ0y1Nrh1G0P50yzSf4 5mdcvEHx55tVwPNrDw/M7rHTPtApQ2up0Tvkm1YKRZYsY7pBJK5sEgJMFM7Fm2eKfEdo Fw+w== X-Gm-Message-State: AFeK/H2mjxRPtw5O0vjkq3ny2Y0/1IXEyMfF3NJxlTs9YaAcFN6mk3NLS5nrRe1ydAEVmqn85C5MLp0J8DH47w== X-Received: by 10.31.26.86 with SMTP id a83mr4508885vka.3.1490407713725; Fri, 24 Mar 2017 19:08:33 -0700 (PDT) MIME-Version: 1.0 Received: by 10.176.68.134 with HTTP; Fri, 24 Mar 2017 19:07:53 -0700 (PDT) In-Reply-To: <9FBAD372-FBCC-4C40-BD2E-622986C7A112@gmail.com> References: <86efb405819a97a5732a89c811889ed7@qrmedia.com> <750C73F0-8F92-4A7D-8DDE-4BA92F8F55FA@quartethealth.com> <9FBAD372-FBCC-4C40-BD2E-622986C7A112@gmail.com> From: harish singh Date: Fri, 24 Mar 2017 19:07:53 -0700 Message-ID: Subject: Re: Scheduler silently dies To: dev@airflow.incubator.apache.org Content-Type: multipart/alternative; boundary=001a113d2f1a58e5a0054b849679 archived-at: Sat, 25 Mar 2017 02:08:46 -0000 --001a113d2f1a58e5a0054b849679 Content-Type: text/plain; charset=UTF-8 We have been using (1.7) over a year and never faced this issue. The moment we switched to 1.8, I think we have hit this issue. The reason why I saw "I think" is because I am not sure if it is the same issue. But whenever I restart, my pipeline proceeds. *Airflow 1.7Having said that, In 1.7, I did face a similar issue (less than 5 times over a year): * *I saw that there were lot of processes marked "" with parent process being "scheduler". * *Somebody mentioned it in this jira -> https://issues.apache.org/jira/browse/AIRFLOW-401 * *Workaround: Restart scheduler* *Airflow 1.8:Now the issue in 1.8 may be different then the issue in 1.7 But again the issue get solved and pipeline progresses on a SCHEDULER RESTART.*If it may help, this is the trace in 1.8: [2017-03-22 19:35:16,332] {models.py:167} INFO - Filling up the DagBag from /usr/local/airflow/pipeline/pipeline.py [2017-03-22 19:35:22,451] {airflow_configuration.py:40} INFO - loading setup.cfg file [2017-03-22 19:35:51,041] {timeout.py:37} ERROR - Process timed out [2017-03-22 19:35:51,041] {models.py:266} ERROR - Failed to import: /usr/local/airflow/pipeline/pipeline.py Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 263, in process_file m = imp.load_source(mod_name, filepath) File "/usr/local/airflow/pipeline/pipeline.py", line 167, in create_tasks(dbguid, version, dag, override_start_date) File "/usr/local/airflow/pipeline/pipeline.py", line 104, in create_tasks t = create_task(dbguid, dag, taskInfo, version, override_date) File "/usr/local/airflow/pipeline/pipeline.py", line 85, in create_task retries, 1, depends_on_past, version, override_dag_date) File "/usr/local/airflow/pipeline/dags/base_pipeline.py", line 90, in create_python_operator depends_on_past=depends_on_past) File "/usr/local/lib/python2.7/dist-packages/airflow/utils/decorators.py", line 86, in wrapper result = func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/airflow/operators/python_operator.py", line 65, in __init__ super(PythonOperator, self).__init__(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/airflow/utils/decorators.py", line 70, in wrapper sig = signature(func) File "/usr/local/lib/python2.7/ dist-packages/funcsigs/__init__.py", line 105, in signature return Signature.from_function(obj) File "/usr/local/lib/python2.7/ dist-packages/funcsigs/__init__.py", line 594, in from_function __validate_parameters__=False) File "/usr/local/lib/python2.7/ dist-packages/funcsigs/__init__.py", line 518, in __init__ for param in parameters)) File "/usr/lib/python2.7/collections.py", line 52, in __init__ self.__update(*args, **kwds) File "/usr/lib/python2.7/_abcoll.py", line 548, in update self[key] = value File "/usr/lib/python2.7/collections.py", line 61, in __setitem__ last[1] = root[0] = self.__map[key] = [last, root, key] File "/usr/local/lib/python2.7/dist-packages/airflow/utils/timeout.py", line 38, in handle_timeout raise AirflowTaskTimeout(self.error_message) AirflowTaskTimeout: Timeout On Fri, Mar 24, 2017 at 5:45 PM, Bolke de Bruin wrote: > We are running *without* num runs for over a year (and never have). It is > a very elusive issue which has not been reproducible. > > I like more info on this but it needs to be very elaborate even to the > point of access to the system exposing the behavior. > > Bolke > > Sent from my iPhone > > > On 24 Mar 2017, at 16:04, Vijay Ramesh wrote: > > > > We literally have a cron job that restarts the scheduler every 30 min. > Num > > runs didn't work consistently in rc4, sometimes it would restart itself > and > > sometimes we'd end up with a few zombie scheduler processes and things > > would get stuck. Also running locally, without celery. > > > >> On Mar 24, 2017 16:02, wrote: > >> > >> We have max runs set and still hit this. Our solution is dumber: > >> monitoring log output, and kill the scheduler if it stops emitting. > Works > >> like a charm. > >> > >>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu > >> wrote: > >>> > >>> Some solutions to this problem is restarting the scheduler frequently > or > >>> some sort of monitoring on the scheduler. We have set up a dag that > pings > >>> cronitor (a dead man's snitch type of service) > >> every > >>> 10 minutes and the snitch pages you when the scheduler dies and does > not > >>> send a ping to it. > >>> > >>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips < > aphillips@qrmedia.com> > >>> wrote: > >>> > >>>> We use celery and run into it from time to time. > >>>>> > >>>> > >>>> Bang goes my theory ;-) At least, assuming it's the same underlying > >>>> cause... > >>>> > >>>> Regards > >>>> > >>>> ap > >>>> > >> > --001a113d2f1a58e5a0054b849679--