airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Van Boxel <a...@vanboxel.be>
Subject Re: Airflow 1.8.0 Alpha 1
Date Wed, 04 Jan 2017 09:31:03 GMT
Report of this nights test run of master (note that I patched the master so
that the duplicate process killer doesn't really kill the process).

I notice that one a *Celery* worker after *exactly 1 hour* a new process
gets started and everything gets confused. Also note that it *doesn't
happen with the local runner*.

For now, my plan is to:
- enhance logging to log to stack-driver and have extra logging information
to troubleshoot (private branch for now)
- dive some more in the scheduler/worker
- my hunch is that the worker starts some process after an hour and starts
up a new task (*if anyone has an idea?!*) or the scheduler thinks the
sensor is dead after one hour...

Here are log extracts:

[2017-01-04 00:00:10,172] {models.py:168} INFO - Filling up the DagBag from
/home/airflow/dags/user_product_interaction.py
[2017-01-04 00:00:11,500] {jobs.py:2012} INFO - Subprocess PID is 87
[2017-01-04 00:00:15,474] {models.py:168} INFO - Filling up the DagBag from
/home/airflow/dags/user_product_interaction.py
[2017-01-04 00:00:17,088] {models.py:1062} INFO - Dependencies all met for
<TaskInstance: user-product-interactions.wait-for-orders 2017-01-03
00:00:00 [queued]>
[2017-01-04 00:00:17,126] {models.py:1062} INFO - Dependencies all met for
<TaskInstance: user-product-interactions.wait-for-orders 2017-01-03
00:00:00 [queued]>
[2017-01-04 00:00:17,127] {models.py:1250} INFO -
--------------------------------------------------------------------------------
Starting attempt 1 of 1
--------------------------------------------------------------------------------

[2017-01-04 00:00:17,205] {models.py:1273} INFO - Executing
<Task(GoogleCloudStorageObjectSensor): wait-for-orders> on 2017-01-03
00:00:00

exactly 1 hour later and lots of messages in between:

2017-01-04 01:00:42,077] {transport.py:151} INFO - Attempting refresh to
obtain initial access_token
[2017-01-04 01:00:42,126] {client.py:795} INFO - Refreshing access_token
[2017-01-04 01:01:26,425] {models.py:168} INFO - Filling up the DagBag from
/home/airflow/dags/user_product_interaction.py
[2017-01-04 01:01:28,620] {jobs.py:2012} INFO - Subprocess PID is 244
[2017-01-04 01:01:32,663] {models.py:168} INFO - Filling up the DagBag from
/home/airflow/dags/user_product_interaction.py
[2017-01-04 01:01:33,527] {jobs.py:2081} WARNING - Recorded hostname and
pid of airflow-worker-1705741-9ncug and 244 do not match this instance's
which are airflow-worker-1705741-9ncug and 87. Taking the poison pill. So
long.
[2017-01-04 01:01:35,134] {models.py:1059} WARNING - Dependencies not met
for <TaskInstance: user-product-interactions.wait-for-orders 2017-01-03
00:00:00 [running]>, dependency 'Task Instance Not Already Running' FAILED:
Task is already running, it started on 2017-01-04 00:00:17.088903.
[2017-01-04 01:01:38,592] {jobs.py:2081} WARNING - Recorded hostname and
pid of airflow-worker-1705741-9ncug and 244 do not match this instance's
which are airflow-worker-1705741-9ncug and 87. Taking the poison pill. So
long.

And probably from the other process that starts up:

[2017-01-04 01:01:42,393] {gcp_api_base_hook.py:81} INFO - Getting
connection using a JSON key file.
[2017-01-04 01:01:42,417] {discovery.py:852} INFO - URL being requested:
GET
https://www.googleapis.com/storage/v1/b/vex-eu-data/o/datasets%2Fmarker%2Fexport%2F2017%2F01%2F04%2F_orders20170101?alt=json
[2017-01-04 01:01:42,417] {transport.py:151} INFO - Attempting refresh to
obtain initial access_token
[2017-01-04 01:01:42,465] {client.py:795} INFO - Refreshing access_token
[2017-01-04 01:01:43,602] {jobs.py:2081} WARNING - Recorded hostname and
pid of airflow-worker-1705741-9ncug and 244 do not match this instance's
which are airflow-worker-1705741-9ncug and 87. Taking the poison pill. So
long.
[2017-01-04 01:01:48,628] {jobs.py:2081} WARNING - Recorded hostname and
pid of airflow-worker-1705741-9ncug and 244 do not match this instance's
which are airflow-worker-1705741-9ncug and 87. Taking the poison pill. So
long.


as a reference the full log (note that the full log is confusing, probably
due to the fact that logs for different processes are appended and uploaded
to Cloud Storage:
https://storage.googleapis.com/vex-eu-data/airflow/default/logs/user-product-interactions/wait-for-orders/2017-01-03T00%3A00%3A00






On Tue, Jan 3, 2017 at 8:34 PM Chris Riccomini <criccomini@apache.org>
wrote:

> Hey Bolke,
>
> Thanks for taking this on. I'm definitely up for running stuff in our
> environments to verify everything is working.
>
> Can I ask that you create a 1.8 alpha 1 branch in the git repo? This will
> make it easier for us to track what changes are getting cherry picked into
> the branch, and will also make it easier for users to pip install, if they
> want to do so via github.
>
> Also, yea, when we switch to beta, we need to stop merging anything other
> than bug fixes into the release branch.
>
> Cheers,
> Chris
>
> On Tue, Jan 3, 2017 at 10:31 AM, Dan Davydov <dan.davydov@airbnb.com
> .invalid
> > wrote:
>
> > All very reasonable to me, one reason we may not have hit the bugs in our
> > production is because we are running off a different merge base and our
> > cherries aren't 1-1 with what we are running in production (we still test
> > them but we can't run them in production), that being said I don't think
> I
> > authored the commits you are referring to so I don't have full context.
> >
> > On Tue, Jan 3, 2017 at 1:27 PM, Bolke de Bruin <bdbruin@gmail.com>
> wrote:
> >
> > > Hi Dan et al,
> > >
> > > That sounds good to me, however I will be pretty critical of the
> changes
> > > in the scheduler and the cleanliness of the patches. This is due to the
> > > fact I have been chasing quite some bugs in master that were pretty
> hard
> > to
> > > track down even with a debugger at hand. I’m surprised that those
> didn’t
> > > pop up in your production or maybe I am concerned ;-). Anyways, I hope
> > you
> > > understand I might be a bit picky in understanding and needing (design)
> > > documentation for some of the changes.
> > >
> > > What I would like to suggest is that for the Alpha versions we still
> > > accept “new” features so these PRs can get in, but from Beta we will
> not
> > > accept new features anymore. For new features in the area of the
> > scheduler
> > > an integration DummyDag should be supplied, so others can test the
> > > behaviour. Does this sound ok?
> > >
> > > My list of open code items for a release looks now like this:
> > >
> > > Blockers
> > > * one_failed not honoured
> > > * Alex’s sensor issue
> > >
> > > New features:
> > > * Schedule all pending DAGs in a single loop
> > > * Add support for backfill true/false
> > > * Impersonation
> > > * CGroups
> > > * Add Cloud Storage updated sensor
> > >
> > > Alpha2 I will package tomorrow. Packages are signed now by my
> apache.org
> > <
> > > http://apache.org/> key. Please verify and let me know if something is
> > > off. I’m still waiting for access to the incubating dist repository.
> > >
> > > Bolke
> > >
> > >
> > > > On 3 Jan 2017, at 14:38, Dan Davydov <dan.davydov@airbnb.com
> .INVALID>
> > > wrote:
> > > >
> > > > I have also started on this effort, recently Alex Guziel and I have
> > been
> > > > pushing Airbnb's custom cherries onto master to get Airbnb back onto
> > > master
> > > > in order for us to do a release.
> > > >
> > > > I think it might make sense to wait for these two commits to get
> merged
> > > in
> > > > since they would be quite nice to have for all Airflow users and seem
> > > like
> > > > they will be merged soon:
> > > > Schedule all pending DAG runs in a single scheduler loop -
> > > > https://github.com/apache/incubator-airflow/pull/1906 <
> > > https://github.com/apache/incubator-airflow/pull/1906>
> > > > Add Support for dag.backfill=(True|False) Option -
> > > > https://github.com/apache/incubator-airflow/pull/1830 <
> > > https://github.com/apache/incubator-airflow/pull/1830>
> > > > Impersonation Support + Cgroups - https://github.com/apache/ <
> > > https://github.com/apache/>
> > > > incubator-airflow/pull/1934 (this is kind of important from the
> Airbnb
> > > side
> > > > so that we can help test the new master without having to cherrypick
> > this
> > > > PR on top of it which would make the testing unreliable for others).
> > > >
> > > > If there are PRs that affect the core of Airflow that other
> committers
> > > > think are important to merge we could include these too. I can commit
> > to
> > > > pushing out the Impersonation/Cgroups PR this week pending PR
> comments.
> > > > What do you think Bolke?
> > > >
> > > > On Tue, Jan 3, 2017 at 4:26 AM, Bolke de Bruin <bdbruin@gmail.com
> > > <mailto:bdbruin@gmail.com>> wrote:
> > > >
> > > >> Hey Alex,
> > > >>
> > > >> I have noticed the same, and it is also the reason why we have Alpha
> > > >> versions. For now I have noticed the following:
> > > >>
> > > >> * Tasks can get in limbo between scheduler and executor:
> > > >> https://github.com/apache/incubator-airflow/pull/1948 <
> > > https://github.com/apache/incubator-airflow/pull/1948> <
> > > >> https://github.com/apache/incubator-airflow/pull/1948 <
> > > https://github.com/apache/incubator-airflow/pull/1948>>
> > > >> * Try_number not increased due to reset in LocalTaskJob:
> > > >> https://github.com/apache/incubator-airflow/pull/1969 <
> > > https://github.com/apache/incubator-airflow/pull/1969> <
> > > >> https://github.com/apache/incubator-airflow/pull/1969 <
> > > https://github.com/apache/incubator-airflow/pull/1969>>
> > > >> * one_failed trigger not executed
> > > >>
> > > >> My idea is to move to a Samba style of releases eventually, but for
> > now
> > > I
> > > >> would like to get master into a state that we understand and
> therefore
> > > not
> > > >> accept any patches that do not address any bugs.
> > > >>
> > > >> If you (or anyone else) can review the above PRs and add your own
as
> > > well
> > > >> then I can create another Alpha version. I’ll be on gitter as much
> as
> > I
> > > can
> > > >> so we can speed up if needed.
> > > >>
> > > >> - Bolke
> > > >>
> > > >>> On 3 Jan 2017, at 08:51, Alex Van Boxel <alex@vanboxel.be>
wrote:
> > > >>>
> > > >>> Hey Bolke,
> > > >>>
> > > >>> thanks for getting this moving. But I already have some blockers,
> > > since I
> > > >>> moved up master to this release (moved from end November to now)
> > > >> stability
> > > >>> has gone down (certainly on Celary). I'm trying to identify the
> core
> > > >>> problems and see if I can fix them.
> > > >>>
> > > >>> On Sat, Dec 31, 2016 at 9:52 PM Bolke de Bruin <bdbruin@gmail.com
> > > >> <mailto:bdbruin@gmail.com <mailto:bdbruin@gmail.com>>>
wrote:
> > > >>>
> > > >>> Dear All,
> > > >>>
> > > >>> On the verge of the New Year, I decided to be a little bit cheeky
> and
> > > to
> > > >>> make available an Airflow 1.8.0 Alpha 1. We have been talking
about
> > it
> > > >> for
> > > >>> a long time now and by doing this I wanted bootstrap the process.
> It
> > > >> should
> > > >>> by no means be considered an Apache release yet. This is for
> testing
> > > >>> purposes in the dev community around Airflow, nothing else.
> > > >>>
> > > >>> The build is exactly the same as the state of master (git 410736d)
> > plus
> > > >> the
> > > >>> change to version “1.8.0.alpha1” in version.py.
> > > >>>
> > > >>> I am dedicating quite some time next week and beyond to get a
> release
> > > >> out.
> > > >>> Hopefully we can get some help with testing, changelog etc. To
make
> > > this
> > > >>> possible I would like to propose a freeze to adding new features
> for
> > at
> > > >>> least two weeks - say until Jan 15.
> > > >>>
> > > >>> You can find the tar here: http://people.apache.org/~bolke/ <
> > > http://people.apache.org/~bolke/> <
> > > >>> http://people.apache.org/~bolke/ <http://people.apache.org/~bolke/
> >
> > <
> > > http://people.apache.org/~bolke/ <http://people.apache.org/~bolke/>>>
> .
> > > >> It isn’t signed. Following versions
> > > >>> will be. SHA is available.
> > > >>>
> > > >>> Lastly, Alpha 1 does not have the fix for retries yet. So we will
> get
> > > an
> > > >>> Alpha 2 :-). @Max / @Dan / @Paul: a potential fix is in
> > > >>> https://github.com/apache/incubator-airflow/pull/1948 <
> > > https://github.com/apache/incubator-airflow/pull/1948> <
> > > >> https://github.com/apache/incubator-airflow/pull/1948 <
> > > https://github.com/apache/incubator-airflow/pull/1948>> <
> > > >>> https://github.com/apache/incubator-airflow/pull/1948 <
> > > https://github.com/apache/incubator-airflow/pull/1948> <
> > > >> https://github.com/apache/incubator-airflow/pull/1948 <
> > > https://github.com/apache/incubator-airflow/pull/1948>>> , but your
> > > >> feedback
> > > >>> is required as it is entrenched in new processing code that you
are
> > > >> running
> > > >>> in production afaik - so I wonder what happens in your fork.
> > > >>>
> > > >>> Happy New Year!
> > > >>>
> > > >>> Bolke
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> _/
> > > >>> _/ Alex Van Boxel
> > >
> > >
> >
>
-- 
  _/
_/ Alex Van Boxel

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message