airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bolke de Bruin <bdbr...@gmail.com>
Subject Re: [RESULT] [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc4
Date Mon, 27 Feb 2017 22:11:30 GMT
Dan

Btw are you running with my patch for this? Or still plain rc?

Cheers
Bolke

Sent from my iPhone

> On 27 Feb 2017, at 22:46, Bolke de Bruin <bdbruin@gmail.com> wrote:
> 
> I'll have a look. I verified and the code is there to take of this. 
> 
> B. 
> 
> Sent from my iPhone
> 
>> On 27 Feb 2017, at 22:34, Dan Davydov <dan.davydov@airbnb.com.INVALID> wrote:
>> 
>> Repro steps:
>> - Create a DAG with a dummy task
>> - Let this DAG run for one dagrun
>> - Add a new subdag operator that contains a dummy operator to this DAG that
>> has depends_on_past set to true
>> - click on the white square for the new subdag operator in the DAGs first
>> dagrun
>> - Click "Zoom into subdag" (takes you to the graph view for the subdag)
>> - Click the dummy task in the graph view and click "Mark Success"
>> - Observe that the list of tasks to mark as success is empty (it should
>> contain the dummy task)
>> 
>>> On Mon, Feb 27, 2017 at 1:03 PM, Bolke de Bruin <bdbruin@gmail.com> wrote:
>>> 
>>> Dan
>>> 
>>> Can you elaborate on 2, cause I thought I specifically took care of that.
>>> 
>>> Cheers
>>> Bolke
>>> 
>>> Sent from my iPhone
>>> 
>>>> On 27 Feb 2017, at 20:27, Dan Davydov <dan.davydov@airbnb.com.INVALID>
>>> wrote:
>>>> 
>>>> I created https://issues.apache.org/jira/browse/AIRFLOW-921 to track the
>>>> pending issues.
>>>> 
>>>> There are two more issues we found which I included there:
>>>> 1. Task instances that have their state manually set to running make the
>>> UI
>>>> for their DAG unable to parse
>>>> 2. Mark success doesn't work for non existent task instances/dagruns
>>> which
>>>> breaks the subdag use case (setting tasks as successful via the graph
>>> view)
>>>> 
>>>>> On Mon, Feb 27, 2017 at 11:06 AM, Bolke de Bruin <bdbruin@gmail.com>
>>> wrote:
>>>>> 
>>>>> Hey Max
>>>>> 
>>>>> It is massive for sure. Sorry about that ;-). However it is not as
>>> massive
>>>>> as you might deduct from a first view. 0) run tasks concurrently across
>>> dag
>>>>> runs 1) ordering of the tasks was added to the loop. 2) calculating of
>>>>> deadlocks, running tasks, tasks to run was corrected, 3) relying on the
>>>>> executor for status updates was replaced, 4) (tbd) executor failure
>>> check
>>>>> to protect against endless Ioops.
>>>>> 
>>>>> 0+1 seem bigger than they are due to the amount of lines changed. 2 is
a
>>>>> subtle change, that touches a couple of lines to pop/push properly. 3)
>>> is
>>>>> bigger, as I didn't like the reliance on the executor. 4) is old code
>>> that
>>>>> needs to be added again.
>>>>> 
>>>>> I probably can leave out 3 which makes 4 mood. The change would be
>>>>> smaller. Maybe I could even completely remove 3 and just add 4. What
are
>>>>> your thoughts?
>>>>> 
>>>>> The random failures we were seeing were the "implicit" test of not a
>>>>> executing in the right order and then deadlocking. But no explicit tests
>>>>> exist. Help would definitely be appreciated.
>>>>> 
>>>>> Yes I thought about using the scheduler and/or reusing logic from the
>>>>> scheduler. I even experimented a little with it but it didn't allow me
>>> to
>>>>> pass the tests effectively.
>>>>> 
>>>>> What I am planning to do is split the function and make it unit testable
>>>>> if you agree with the current approach.
>>>>> 
>>>>> Bolke
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On 27 Feb 2017, at 18:35, Maxime Beauchemin <
>>> maximebeauchemin@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> This PR is pretty massive and complex! It looks like solid work but
>>> let's
>>>>>> be really careful around testing and rolling this out.
>>>>>> 
>>>>>> This may be out of scope for this PR, but wanted to discuss the idea
of
>>>>>> using the scheduler's logic to perform backfills. It'd be nice to
have
>>>>> that
>>>>>> logic in one place, though I lost grasp on the details around
>>> feasibility
>>>>>> around this approach. I'm sure you looked into this option before
>>> issuing
>>>>>> this PR and I'm curious to hear your thoughts on blockers/challenges
>>>>> around
>>>>>> this alternate approach.
>>>>>> 
>>>>>> Also I'm wondering whether we have any sort of mechanisms in our
>>>>>> integration test to validate that task dependencies are respected
and
>>> run
>>>>>> in the right order. If not I was thinking we could build some
>>> abstraction
>>>>>> to make it easy to write this type of tests in an expressive way.
>>>>>> 
>>>>>> ```
>>>>>> #[some code to run a backfill, or a scheduler session]
>>>>>> it = IntegrationTestResults(dag_id='exmaple1')
>>>>>> assert it.ran_before('task1', 'task_2')
>>>>>> assert ti.overlapped('task1', 'task_3') # confirms 2 tasks ran in
>>>>> parallel
>>>>>> assert ti.none_failed()
>>>>>> assert ti.ran_last('root')
>>>>>> assert ti.max_concurrency_reached() == POOL_LIMIT
>>>>>> ```
>>>>>> 
>>>>>> Max
>>>>>> 
>>>>>>> On Mon, Feb 27, 2017 at 5:41 AM, Bolke de Bruin <bdbruin@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>> I have worked in the Backfill issue also in collaboration with
>>> Jeremiah.
>>>>>>> 
>>>>>>> The refactor to use dag runs in backfills caused a regression
>>>>>>> in task execution performance as dag runs were executed
>>>>>>> sequentially. Next to that, the backfills were non deterministic
>>>>>>> due to the random execution of tasks, causing root tasks
>>>>>>> being added to the non ready list too soon.
>>>>>>> 
>>>>>>> This updates the backfill logic as follows:
>>>>>>> 
>>>>>>>     • Parallelize execution of tasks
>>>>>>>     • Use a leave first execution model; Breadth-first algorithm
by
>>>>>>> Jerermiah
>>>>>>>     • Replace state updates from the executor by task based
only
>>>>>>> updates
>>>>>>> 
>>>>>>> https://github.com/apache/incubator-airflow/pull/2107
>>>>>>> 
>>>>>>> Please review and test properly.
>>>>>>> 
>>>>>>> What has been left out at the moment is the checking the executor
>>> itself
>>>>>>> for multiple failures of a task run, where the task itself was
never
>>>>> able
>>>>>>> to execute. Let me know if this is a real world scenario (maybe
when
>>>>> disk
>>>>>>> space issue?). I will add it back in.
>>>>>>> 
>>>>>>> - Bolke
>>>>>>> 
>>>>>>> 
>>>>>>>> On 25 Feb 2017, at 09:07, Bolke de Bruin <bdbruin@gmail.com>
wrote:
>>>>>>>> 
>>>>>>>> Hi Dan,
>>>>>>>> 
>>>>>>>> - Backfill indeed runs only one dagrun at the time, see line
1755 of
>>>>>>> jobs.py. I’ll think about how to fix this over the weekend
(I think it
>>>>> was
>>>>>>> my change that introduced this). Suggestions always welcome.
Depending
>>>>> the
>>>>>>> impact it is a blocker or not. We don’t often use backfills
and
>>>>> definitely
>>>>>>> not at your size, so that is why it didn’t pop up with us.
I’m
>>> assuming
>>>>>>> blocker for now, btw.
>>>>>>>> - Speculation on the High DB Load. I’m not sure what your
benchmark
>>> is
>>>>>>> here (1.7.1 + multi processor dags?), but as you mentioned in
the code
>>>>>>> dependencies are checked a couple of times for one run and even
task
>>>>>>> instance. Dependency checking requires aggregation on the DB,
which
>>> is a
>>>>>>> performance killer. Annoying but not a blocker.
>>>>>>>> - Skipped tasks potentially cause a dagrun to be marked
>>> failure/success
>>>>>>> prematurely. BranchOperators are widely used if it affects these
>>>>> operators,
>>>>>>> then it is a blocker.
>>>>>>>> 
>>>>>>>> - Bolke
>>>>>>>> 
>>>>>>>>> On 25 Feb 2017, at 02:04, Dan Davydov <dan.davydov@airbnb.com.
>>>>> INVALID>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Update on old pending issues:
>>>>>>>>> - Black Squares in UI: Fix merged
>>>>>>>>> - Double Trigger Issue That Alex G Mentioned: Alex has
a PR in
>>> flight
>>>>>>>>> 
>>>>>>>>> New Issues:
>>>>>>>>> - Backfill seems to be having issues (only running one
dagrun at a
>>>>>>> time),
>>>>>>>>> we are still investigating - might be a blocker
>>>>>>>>> - High DB Load (~8x more than 1.7) - We are still investigating
but
>>>>> it's
>>>>>>>>> probably not a blocker for the release
>>>>>>>>> - Skipped tasks potentially cause a dagrun to be marked
as
>>>>>>> failure/success
>>>>>>>>> prematurely - not sure whether or not to classify this
as a blocker
>>>>>>> (only
>>>>>>>>> really an issue for users who use the BranchingPythonOperator,
which
>>>>>>> AirBnB
>>>>>>>>> does)
>>>>>>>>> 
>>>>>>>>> On Thu, Feb 23, 2017 at 5:59 PM, siddharth anand <sanand@apache.org
>>>> 
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> IMHO, a DAG run without a start date is non-sensical
but is not
>>>>>>> enforced
>>>>>>>>>> That said, our UI allows for the manual creation
of DAG Runs
>>> without
>>>>> a
>>>>>>>>>> start date as shown in the images below:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> - https://www.dropbox.com/s/3sxcqh04eztpl7p/Screenshot%
>>>>>>>>>> 202017-02-22%2016.00.40.png?dl=0
>>>>>>>>>> <https://www.dropbox.com/s/3sxcqh04eztpl7p/Screenshot%
>>>>>>>>>> 202017-02-22%2016.00.40.png?dl=0>
>>>>>>>>>> - https://www.dropbox.com/s/4q6rr9dwghag1yy/Screenshot%
>>>>>>>>>> 202017-02-22%2016.02.22.png?dl=0
>>>>>>>>>> <https://www.dropbox.com/s/4q6rr9dwghag1yy/Screenshot%
>>>>>>>>>> 202017-02-22%2016.02.22.png?dl=0>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Feb 22, 2017 at 2:26 PM, Maxime Beauchemin
<
>>>>>>>>>> maximebeauchemin@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Our database may have edge cases that could be
associated with
>>>>> running
>>>>>>>>>> any
>>>>>>>>>>> previous version that may or may not have been
part of an official
>>>>>>>>>> release.
>>>>>>>>>>> 
>>>>>>>>>>> Let's see if anyone else reports the issue. If
no one does, one
>>>>>>> option is
>>>>>>>>>>> to release 1.8.0 as is with a comment in the
release notes, and
>>>>> have a
>>>>>>>>>>> future official minor apache release 1.8.1 that
would fix these
>>>>> minor
>>>>>>>>>>> issues that are not deal breaker.
>>>>>>>>>>> 
>>>>>>>>>>> @bolke, I'm curious, how long does it take you
to go through one
>>>>>>> release
>>>>>>>>>>> cycle? Oh, and do you have a documented step
by step process for
>>>>>>>>>> releasing?
>>>>>>>>>>> I'd like to add the Pypi part to this doc and
add committers that
>>>>> are
>>>>>>>>>>> interested to have rights on the project on Pypi.
>>>>>>>>>>> 
>>>>>>>>>>> Max
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Feb 22, 2017 at 2:00 PM, Bolke de Bruin
<
>>> bdbruin@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> So it is a database integrity issue? Afaik
a start_date should
>>>>> always
>>>>>>>>>> be
>>>>>>>>>>>> set for a DagRun (create_dagrun) does so
 I didn't check the code
>>>>>>>>>> though.
>>>>>>>>>>>> 
>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 22 Feb 2017, at 22:19, Dan Davydov
<dan.davydov@airbnb.com.
>>>>>>>>>> INVALID>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Should clarify this occurs when a dagrun
does not have a start
>>>>> date,
>>>>>>>>>>> not
>>>>>>>>>>>> a
>>>>>>>>>>>>> dag (which makes it even less likely
to happen). I don't think
>>>>> this
>>>>>>>>>> is
>>>>>>>>>>> a
>>>>>>>>>>>>> blocker for releasing.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Feb 22, 2017 at 1:15 PM,
Dan Davydov <
>>>>>>>>>> dan.davydov@airbnb.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I rolled this out in our prod and
the webservers failed to load
>>>>> due
>>>>>>>>>> to
>>>>>>>>>>>>>> this commit:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [AIRFLOW-510] Filter Paused Dags,
show Last Run & Trigger Dag
>>>>>>>>>>>>>> 7c94d81c390881643f94d5e3d7d6fb351a445b72
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This fixed it:
>>>>>>>>>>>>>> -                            </a>
<span id="statuses_info"
>>>>>>>>>>>>>> class="glyphicon glyphicon-info-sign"
aria-hidden="true"
>>>>>>>>>> title="Start
>>>>>>>>>>>> Date:
>>>>>>>>>>>>>> {{last_run.start_date.strftime('%Y-%m-%d
%H:%M')}}"></span>
>>>>>>>>>>>>>> +                            </a>
<span id="statuses_info"
>>>>>>>>>>>>>> class="glyphicon glyphicon-info-sign"
>>> aria-hidden="true"></span>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is caused by assuming that all
DAGs have start dates set,
>>>>> so a
>>>>>>>>>>>> broken
>>>>>>>>>>>>>> DAG will take down the whole UI.
Not sure if we want to make
>>>>> this a
>>>>>>>>>>>> blocker
>>>>>>>>>>>>>> for the release or not, I'm guessing
for most deployments this
>>>>>>> would
>>>>>>>>>>>> occur
>>>>>>>>>>>>>> pretty rarely. I'll submit a PR to
fix it soon.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Feb 21, 2017 at 9:49 AM,
Chris Riccomini <
>>>>>>>>>>> criccomini@apache.org
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Ack that the vote has already
passed, but belated +1 (binding)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Feb 21, 2017 at 7:42
AM, Bolke de Bruin <
>>>>>>> bdbruin@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> IPMC Voting can be found
here:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-general/
>>>>>>>>>>>>>>> 201702.mbox/%
>>>>>>>>>>>>>>>> 3c676BDC9F-1B55-4469-92A7-9FF309AD0EC8@gmail.com%3e
<
>>>>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-general/
>>>>>>>>>>>>>>> 201702.mbox/%
>>>>>>>>>>>>>>>> 3C676BDC9F-1B55-4469-92A7-9FF309AD0EC8@gmail.com%3E>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>> Bolke
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 21 Feb 2017, at 08:20,
Bolke de Bruin <bdbruin@gmail.com
>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Apache Airflow (incubating)
1.8.0 (based on RC4) has been
>>>>>>>>>> accepted.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 9 “+1” votes received:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> - Maxime Beauchemin (binding)
>>>>>>>>>>>>>>>>> - Arthur Wiedmer (binding)
>>>>>>>>>>>>>>>>> - Dan Davydov (binding)
>>>>>>>>>>>>>>>>> - Jeremiah Lowin (binding)
>>>>>>>>>>>>>>>>> - Siddharth Anand (binding)
>>>>>>>>>>>>>>>>> - Alex van Boxel (binding)
>>>>>>>>>>>>>>>>> - Bolke de Bruin (binding)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> - Jayesh Senjaliya (non-binding)
>>>>>>>>>>>>>>>>> - Yi (non-binding)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Vote thread (start):
>>>>>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-
>>>>>>>>>>>>>>>> airflow-dev/201702.mbox/%3cD360D9BE-C358-42A1-9188-
>>>>>>>>>>>>>>>> 6C92C31A2F8B@gmail.com%3e
<http://mail-archives.apache.
>>>>>>>>>>>>>>>> org/mod_mbox/incubator-airflow-dev/201702.mbox/%3C7EB7B6D6-
>>>>>>>>>>>>>>> 092E-48D2-AA0F-
>>>>>>>>>>>>>>>> 15F44376A8FF@gmail.com%3E>
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Next steps:
>>>>>>>>>>>>>>>>> 1) will start the voting
process at the IPMC mailinglist. I
>>> do
>>>>>>>>>>> expect
>>>>>>>>>>>>>>>> some changes to be required
mostly in documentation maybe a
>>>>>>>>>> license
>>>>>>>>>>>> here
>>>>>>>>>>>>>>>> and there. So, we might end
up with changes to stable. As
>>> long
>>>>> as
>>>>>>>>>>>> these
>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>> not (significant) code changes
I will not re-raise the vote.
>>>>>>>>>>>>>>>>> 2) Only after the positive
voting on the IPMC and
>>>>> finalisation I
>>>>>>>>>>> will
>>>>>>>>>>>>>>>> rebrand the RC to Release.
>>>>>>>>>>>>>>>>> 3) I will upload it to
the incubator release page, then the
>>>>> tar
>>>>>>>>>>> ball
>>>>>>>>>>>>>>>> needs to propagate to the
mirrors.
>>>>>>>>>>>>>>>>> 4) Update the website
(can someone volunteer please?)
>>>>>>>>>>>>>>>>> 5) Finally, I will ask
Maxime to upload it to pypi. It seems
>>>>> we
>>>>>>>>>> can
>>>>>>>>>>>>>>> keep
>>>>>>>>>>>>>>>> the apache branding as lib
cloud is doing this as well (
>>>>>>>>>>>>>>>> https://libcloud.apache.org/downloads.html#pypi-package
<
>>>>>>>>>>>>>>>> https://libcloud.apache.org/downloads.html#pypi-package>).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Jippie!
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Bolke
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>> 

Mime
View raw message