airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bolke de Bruin <bdbr...@gmail.com>
Subject Re: [RESULT] [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc4
Date Mon, 27 Feb 2017 21:03:51 GMT
Dan

Can you elaborate on 2, cause I thought I specifically took care of that. 

Cheers
Bolke

Sent from my iPhone

> On 27 Feb 2017, at 20:27, Dan Davydov <dan.davydov@airbnb.com.INVALID> wrote:
> 
> I created https://issues.apache.org/jira/browse/AIRFLOW-921 to track the
> pending issues.
> 
> There are two more issues we found which I included there:
> 1. Task instances that have their state manually set to running make the UI
> for their DAG unable to parse
> 2. Mark success doesn't work for non existent task instances/dagruns which
> breaks the subdag use case (setting tasks as successful via the graph view)
> 
>> On Mon, Feb 27, 2017 at 11:06 AM, Bolke de Bruin <bdbruin@gmail.com> wrote:
>> 
>> Hey Max
>> 
>> It is massive for sure. Sorry about that ;-). However it is not as massive
>> as you might deduct from a first view. 0) run tasks concurrently across dag
>> runs 1) ordering of the tasks was added to the loop. 2) calculating of
>> deadlocks, running tasks, tasks to run was corrected, 3) relying on the
>> executor for status updates was replaced, 4) (tbd) executor failure check
>> to protect against endless Ioops.
>> 
>> 0+1 seem bigger than they are due to the amount of lines changed. 2 is a
>> subtle change, that touches a couple of lines to pop/push properly. 3) is
>> bigger, as I didn't like the reliance on the executor. 4) is old code that
>> needs to be added again.
>> 
>> I probably can leave out 3 which makes 4 mood. The change would be
>> smaller. Maybe I could even completely remove 3 and just add 4. What are
>> your thoughts?
>> 
>> The random failures we were seeing were the "implicit" test of not a
>> executing in the right order and then deadlocking. But no explicit tests
>> exist. Help would definitely be appreciated.
>> 
>> Yes I thought about using the scheduler and/or reusing logic from the
>> scheduler. I even experimented a little with it but it didn't allow me to
>> pass the tests effectively.
>> 
>> What I am planning to do is split the function and make it unit testable
>> if you agree with the current approach.
>> 
>> Bolke
>> 
>> Sent from my iPhone
>> 
>>> On 27 Feb 2017, at 18:35, Maxime Beauchemin <maximebeauchemin@gmail.com>
>> wrote:
>>> 
>>> This PR is pretty massive and complex! It looks like solid work but let's
>>> be really careful around testing and rolling this out.
>>> 
>>> This may be out of scope for this PR, but wanted to discuss the idea of
>>> using the scheduler's logic to perform backfills. It'd be nice to have
>> that
>>> logic in one place, though I lost grasp on the details around feasibility
>>> around this approach. I'm sure you looked into this option before issuing
>>> this PR and I'm curious to hear your thoughts on blockers/challenges
>> around
>>> this alternate approach.
>>> 
>>> Also I'm wondering whether we have any sort of mechanisms in our
>>> integration test to validate that task dependencies are respected and run
>>> in the right order. If not I was thinking we could build some abstraction
>>> to make it easy to write this type of tests in an expressive way.
>>> 
>>> ```
>>> #[some code to run a backfill, or a scheduler session]
>>> it = IntegrationTestResults(dag_id='exmaple1')
>>> assert it.ran_before('task1', 'task_2')
>>> assert ti.overlapped('task1', 'task_3') # confirms 2 tasks ran in
>> parallel
>>> assert ti.none_failed()
>>> assert ti.ran_last('root')
>>> assert ti.max_concurrency_reached() == POOL_LIMIT
>>> ```
>>> 
>>> Max
>>> 
>>>> On Mon, Feb 27, 2017 at 5:41 AM, Bolke de Bruin <bdbruin@gmail.com>
>> wrote:
>>>> 
>>>> I have worked in the Backfill issue also in collaboration with Jeremiah.
>>>> 
>>>> The refactor to use dag runs in backfills caused a regression
>>>> in task execution performance as dag runs were executed
>>>> sequentially. Next to that, the backfills were non deterministic
>>>> due to the random execution of tasks, causing root tasks
>>>> being added to the non ready list too soon.
>>>> 
>>>> This updates the backfill logic as follows:
>>>> 
>>>>       • Parallelize execution of tasks
>>>>       • Use a leave first execution model; Breadth-first algorithm by
>>>> Jerermiah
>>>>       • Replace state updates from the executor by task based only
>>>> updates
>>>> 
>>>> https://github.com/apache/incubator-airflow/pull/2107
>>>> 
>>>> Please review and test properly.
>>>> 
>>>> What has been left out at the moment is the checking the executor itself
>>>> for multiple failures of a task run, where the task itself was never
>> able
>>>> to execute. Let me know if this is a real world scenario (maybe when
>> disk
>>>> space issue?). I will add it back in.
>>>> 
>>>> - Bolke
>>>> 
>>>> 
>>>>> On 25 Feb 2017, at 09:07, Bolke de Bruin <bdbruin@gmail.com> wrote:
>>>>> 
>>>>> Hi Dan,
>>>>> 
>>>>> - Backfill indeed runs only one dagrun at the time, see line 1755 of
>>>> jobs.py. I’ll think about how to fix this over the weekend (I think it
>> was
>>>> my change that introduced this). Suggestions always welcome. Depending
>> the
>>>> impact it is a blocker or not. We don’t often use backfills and
>> definitely
>>>> not at your size, so that is why it didn’t pop up with us. I’m assuming
>>>> blocker for now, btw.
>>>>> - Speculation on the High DB Load. I’m not sure what your benchmark
is
>>>> here (1.7.1 + multi processor dags?), but as you mentioned in the code
>>>> dependencies are checked a couple of times for one run and even task
>>>> instance. Dependency checking requires aggregation on the DB, which is a
>>>> performance killer. Annoying but not a blocker.
>>>>> - Skipped tasks potentially cause a dagrun to be marked failure/success
>>>> prematurely. BranchOperators are widely used if it affects these
>> operators,
>>>> then it is a blocker.
>>>>> 
>>>>> - Bolke
>>>>> 
>>>>>> On 25 Feb 2017, at 02:04, Dan Davydov <dan.davydov@airbnb.com.
>> INVALID>
>>>> wrote:
>>>>>> 
>>>>>> Update on old pending issues:
>>>>>> - Black Squares in UI: Fix merged
>>>>>> - Double Trigger Issue That Alex G Mentioned: Alex has a PR in flight
>>>>>> 
>>>>>> New Issues:
>>>>>> - Backfill seems to be having issues (only running one dagrun at
a
>>>> time),
>>>>>> we are still investigating - might be a blocker
>>>>>> - High DB Load (~8x more than 1.7) - We are still investigating but
>> it's
>>>>>> probably not a blocker for the release
>>>>>> - Skipped tasks potentially cause a dagrun to be marked as
>>>> failure/success
>>>>>> prematurely - not sure whether or not to classify this as a blocker
>>>> (only
>>>>>> really an issue for users who use the BranchingPythonOperator, which
>>>> AirBnB
>>>>>> does)
>>>>>> 
>>>>>> On Thu, Feb 23, 2017 at 5:59 PM, siddharth anand <sanand@apache.org>
>>>> wrote:
>>>>>> 
>>>>>>> IMHO, a DAG run without a start date is non-sensical but is not
>>>> enforced
>>>>>>> That said, our UI allows for the manual creation of DAG Runs
without
>> a
>>>>>>> start date as shown in the images below:
>>>>>>> 
>>>>>>> 
>>>>>>> - https://www.dropbox.com/s/3sxcqh04eztpl7p/Screenshot%
>>>>>>> 202017-02-22%2016.00.40.png?dl=0
>>>>>>> <https://www.dropbox.com/s/3sxcqh04eztpl7p/Screenshot%
>>>>>>> 202017-02-22%2016.00.40.png?dl=0>
>>>>>>> - https://www.dropbox.com/s/4q6rr9dwghag1yy/Screenshot%
>>>>>>> 202017-02-22%2016.02.22.png?dl=0
>>>>>>> <https://www.dropbox.com/s/4q6rr9dwghag1yy/Screenshot%
>>>>>>> 202017-02-22%2016.02.22.png?dl=0>
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Feb 22, 2017 at 2:26 PM, Maxime Beauchemin <
>>>>>>> maximebeauchemin@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Our database may have edge cases that could be associated
with
>> running
>>>>>>> any
>>>>>>>> previous version that may or may not have been part of an
official
>>>>>>> release.
>>>>>>>> 
>>>>>>>> Let's see if anyone else reports the issue. If no one does,
one
>>>> option is
>>>>>>>> to release 1.8.0 as is with a comment in the release notes,
and
>> have a
>>>>>>>> future official minor apache release 1.8.1 that would fix
these
>> minor
>>>>>>>> issues that are not deal breaker.
>>>>>>>> 
>>>>>>>> @bolke, I'm curious, how long does it take you to go through
one
>>>> release
>>>>>>>> cycle? Oh, and do you have a documented step by step process
for
>>>>>>> releasing?
>>>>>>>> I'd like to add the Pypi part to this doc and add committers
that
>> are
>>>>>>>> interested to have rights on the project on Pypi.
>>>>>>>> 
>>>>>>>> Max
>>>>>>>> 
>>>>>>>> On Wed, Feb 22, 2017 at 2:00 PM, Bolke de Bruin <bdbruin@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> So it is a database integrity issue? Afaik a start_date
should
>> always
>>>>>>> be
>>>>>>>>> set for a DagRun (create_dagrun) does so  I didn't check
the code
>>>>>>> though.
>>>>>>>>> 
>>>>>>>>> Sent from my iPhone
>>>>>>>>> 
>>>>>>>>>> On 22 Feb 2017, at 22:19, Dan Davydov <dan.davydov@airbnb.com.
>>>>>>> INVALID>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Should clarify this occurs when a dagrun does not
have a start
>> date,
>>>>>>>> not
>>>>>>>>> a
>>>>>>>>>> dag (which makes it even less likely to happen).
I don't think
>> this
>>>>>>> is
>>>>>>>> a
>>>>>>>>>> blocker for releasing.
>>>>>>>>>> 
>>>>>>>>>>> On Wed, Feb 22, 2017 at 1:15 PM, Dan Davydov
<
>>>>>>> dan.davydov@airbnb.com>
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I rolled this out in our prod and the webservers
failed to load
>> due
>>>>>>> to
>>>>>>>>>>> this commit:
>>>>>>>>>>> 
>>>>>>>>>>> [AIRFLOW-510] Filter Paused Dags, show Last Run
& Trigger Dag
>>>>>>>>>>> 7c94d81c390881643f94d5e3d7d6fb351a445b72
>>>>>>>>>>> 
>>>>>>>>>>> This fixed it:
>>>>>>>>>>> -                            </a> <span
id="statuses_info"
>>>>>>>>>>> class="glyphicon glyphicon-info-sign" aria-hidden="true"
>>>>>>> title="Start
>>>>>>>>> Date:
>>>>>>>>>>> {{last_run.start_date.strftime('%Y-%m-%d %H:%M')}}"></span>
>>>>>>>>>>> +                            </a> <span
id="statuses_info"
>>>>>>>>>>> class="glyphicon glyphicon-info-sign" aria-hidden="true"></span>
>>>>>>>>>>> 
>>>>>>>>>>> This is caused by assuming that all DAGs have
start dates set,
>> so a
>>>>>>>>> broken
>>>>>>>>>>> DAG will take down the whole UI. Not sure if
we want to make
>> this a
>>>>>>>>> blocker
>>>>>>>>>>> for the release or not, I'm guessing for most
deployments this
>>>> would
>>>>>>>>> occur
>>>>>>>>>>> pretty rarely. I'll submit a PR to fix it soon.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Feb 21, 2017 at 9:49 AM, Chris Riccomini
<
>>>>>>>> criccomini@apache.org
>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Ack that the vote has already passed, but
belated +1 (binding)
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Feb 21, 2017 at 7:42 AM, Bolke de
Bruin <
>>>> bdbruin@gmail.com
>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> IPMC Voting can be found here:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-general/
>>>>>>>>>>>> 201702.mbox/%
>>>>>>>>>>>>> 3c676BDC9F-1B55-4469-92A7-9FF309AD0EC8@gmail.com%3e
<
>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-general/
>>>>>>>>>>>> 201702.mbox/%
>>>>>>>>>>>>> 3C676BDC9F-1B55-4469-92A7-9FF309AD0EC8@gmail.com%3E>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>> Bolke
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 21 Feb 2017, at 08:20, Bolke de
Bruin <bdbruin@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Apache Airflow (incubating) 1.8.0
(based on RC4) has been
>>>>>>> accepted.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 9 “+1” votes received:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> - Maxime Beauchemin (binding)
>>>>>>>>>>>>>> - Arthur Wiedmer (binding)
>>>>>>>>>>>>>> - Dan Davydov (binding)
>>>>>>>>>>>>>> - Jeremiah Lowin (binding)
>>>>>>>>>>>>>> - Siddharth Anand (binding)
>>>>>>>>>>>>>> - Alex van Boxel (binding)
>>>>>>>>>>>>>> - Bolke de Bruin (binding)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> - Jayesh Senjaliya (non-binding)
>>>>>>>>>>>>>> - Yi (non-binding)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Vote thread (start):
>>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-
>>>>>>>>>>>>> airflow-dev/201702.mbox/%3cD360D9BE-C358-42A1-9188-
>>>>>>>>>>>>> 6C92C31A2F8B@gmail.com%3e <http://mail-archives.apache.
>>>>>>>>>>>>> org/mod_mbox/incubator-airflow-dev/201702.mbox/%3C7EB7B6D6-
>>>>>>>>>>>> 092E-48D2-AA0F-
>>>>>>>>>>>>> 15F44376A8FF@gmail.com%3E>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Next steps:
>>>>>>>>>>>>>> 1) will start the voting process
at the IPMC mailinglist. I do
>>>>>>>> expect
>>>>>>>>>>>>> some changes to be required mostly in
documentation maybe a
>>>>>>> license
>>>>>>>>> here
>>>>>>>>>>>>> and there. So, we might end up with changes
to stable. As long
>> as
>>>>>>>>> these
>>>>>>>>>>>> are
>>>>>>>>>>>>> not (significant) code changes I will
not re-raise the vote.
>>>>>>>>>>>>>> 2) Only after the positive voting
on the IPMC and
>> finalisation I
>>>>>>>> will
>>>>>>>>>>>>> rebrand the RC to Release.
>>>>>>>>>>>>>> 3) I will upload it to the incubator
release page, then the
>> tar
>>>>>>>> ball
>>>>>>>>>>>>> needs to propagate to the mirrors.
>>>>>>>>>>>>>> 4) Update the website (can someone
volunteer please?)
>>>>>>>>>>>>>> 5) Finally, I will ask Maxime to
upload it to pypi. It seems
>> we
>>>>>>> can
>>>>>>>>>>>> keep
>>>>>>>>>>>>> the apache branding as lib cloud is doing
this as well (
>>>>>>>>>>>>> https://libcloud.apache.org/downloads.html#pypi-package
<
>>>>>>>>>>>>> https://libcloud.apache.org/downloads.html#pypi-package>).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Jippie!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Bolke
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 

Mime
View raw message