airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Maguire <andrewm4...@gmail.com>
Subject Re: Scheduler Keeps Crashing
Date Wed, 25 Oct 2017 19:24:32 GMT
Yeah, that makes sense - i think in my rush to upgrade to 1.9 i at some
stage ran "airflow upgradedb" and it just did nothing (and then i killed
it). I think i also ran it while the scheduler was still active reading and
writing to the database. Also was a day after i upgraded with pip that i
realised in needed to run upgradedb so all in all i think i messed up the
upgrade.

I have done an uninstall and reinstall to 1.8.2 and just pointed it at a
new database :) seems to be ok now.

Think i'll need to be a bit more careful in upgrading especially when
requires a database change.

Will probably stay on 1.8.2 for now as we hoping to get a data engineer in
to help with all this.

Cheers,
Andy


On Wed, Oct 25, 2017 at 8:11 PM Joy Gao <joyg@wepay.com> wrote:

> I've had something similar happening to me in the past (`airflow upgradedb`
> stuck forever waiting for a lock on a model). Because some migration
> scripts require querying the db, it got stuck waiting for the lock on a
> table.
>
> For me it had to do with calling a model's class method in a dag file,
> which threw an error after an ORM query, and did not properly release the
> lock on the table. As Bolke mentioned, I was able to track down the exact
> locking table/query by using `show engine innodb status;`.
>
> Hope it helps!
>
> On Wed, Oct 25, 2017 at 4:41 AM, Bolke de Bruin <bdbruin@gmail.com> wrote:
>
> > Hi Andrew,
> >
> > Something is indeed keeping a lock on the database and not freeing it. If
> > it crashes can you find out what is exactly being locked by following the
> > guidelines in this post: https://serverfault.com/
> > questions/36260/how-can-i-show-mysql-locks <https://serverfault.com/
> > questions/36260/how-can-i-show-mysql-locks>
> >
> > As locks typically are very short (ms) I am wondering if you are sure
> that
> > you are running just *one* scheduler and not trying to start multiple at
> > the same time? Another option is that one of the tasks is misbehaving.
> >
> > Having the process list and associated PIDs (and programs) is key in
> > tracking this down.
> >
> > Bolke
> >
> >
> > > On 25 Oct 2017, at 12:42, Andrew Maguire <andrewm4894@gmail.com>
> wrote:
> > >
> > > Hi,
> > >
> > > I'm not really sure exactly what is causing this by i'm noticing lots
> of
> > dead periods on our airflow server. Like this:
> > >
> > >
> > >
> > > And it corresponds with no work being actually done if i look at the
> > database:
> > >
> > >
> > > When i look in supervisor i typically see:
> > >
> > >
> > > So the airflow scheduler seems to keep being restarted by supervisor.
> > >
> > > Then if i look in a recent stderr log from the scheduler i see:
> > > Traceback (most recent call last):
> > >   File "/usr/local/bin/airflow", line 27, in <module>
> > >     args.func(args)
> > >   File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py",
> > line 828, in scheduler
> > >     job.run()
> > >   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
> > 197, in run
> > >     self._execute()
> > >   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
> > 1496, in _execute
> > >     self._execute_helper(processor_manager)
> > >   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
> > 1541, in _execute_helper
> > >     self.reset_state_for_orphaned_tasks(session=session)
> > >   File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py",
> > line 50, in wrapper
> > >     result = func(*args, **kwargs)
> > >   File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line
> > 265, in reset_state_for_orphaned_tasks
> > >     .filter(or_(*filter_for_tis), TI.state.in_(resettable_states))
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line
> > 2679, in all
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line
> > 2831, in __iter__
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/orm/query.py", line
> > 2854, in _execute_and_instances
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
> > 945, in execute
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/sql/elements.py", line
> > 263, in _execute_on_connection
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
> > 1053, in _execute_clauseelement
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
> > 1189, in _execute_context
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
> > 1393, in _handle_dbapi_exception
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/util/compat.py", line
> > 203, in raise_from_cause
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/base.py", line
> > 1182, in _execute_context
> > >   File "build/bdist.linux-x86_64/egg/sqlalchemy/engine/default.py",
> > line 470, in do_execute
> > >   File "/usr/local/lib/python2.7/dist-packages/MySQLdb/cursors.py",
> > line 205, in execute
> > >     self.errorhandler(self, exc, value)
> > >   File "/usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py",
> > line 36, in defaulterrorhandler
> > >     raise errorclass, errorvalue
> > > sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError)
> > (1205, 'Lock wait timeout exceeded; try restarting transaction') [SQL:
> > u'SELECT task_instance.task_id AS task_instance_task_id, ...
> > > I'm not really sure what might be causing this or some ways to go about
> > debugging it.
> > > I have been struggling through it by having a crontab do:
> > > # reboot system a few times a day
> > > 45 1,13 * * * /sbin/shutdown -r +5
> > > So rebooting the system a few times a day and then it typically comes
> > back to life and sorts itself out (if that's a technical term :) )
> > > So i'm now going to do the reboot every 4 hours as i've been caught a
> > few times when it goes into an error state a good while before the new
> > reboot from cron.
> > > However i know this is a really horrible solution and i really should
> > try figure out what the root cause is and a better solution.
> > > So i'm wondering if you might have any ideas or suggestions as to what
> > might be going on?
> > > I'm running airplot 1.9.0alpha0 in a single google compute instance
> > (Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-97-generic x86_64) - 24 vCPUs, 100
> GB
> > memory) with LocalExecutor as pretty much all our dags are just sequences
> > of bigquery operators.
> > > Also using a Google Cloud MySql database.
> > > It looks like some sort of lock when trying to update the database but
> > i'm not sure if that's just a symptom or more related to a cause.
> > > Any help is much appreciated as i'm a data scientist and not a data
> > engineer so admittedly a bit (quite) out of my depth here.
> > > Cheers,
> > > Andy
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message