airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ash Berlin-Taylor <...@apache.org>
Subject Re: Need more pairs of eyes for the flaky LocalExecutorTest fix
Date Tue, 30 Apr 2019 08:39:26 GMT
Oh sorry - in the core Scheduler for the DAG parsing (inside jobs.py) not in the Celery Executor.


> On 30 Apr 2019, at 08:57, Driesprong, Fokko <fokko@driesprong.frl> wrote:
> 
> Looking at the PR I only see them being used in the LocalExecutor and the
> KubernetesExecutor, which makes sense. For CeleryExecutor we have Celery in
> place which handles the callbacks for us, therefore we don't need to poll.
> 
> Apart from that, I think this is an improvement over the current situation.
> 
> Cheers, Fokko
> 
> Op ma 29 apr. 2019 om 23:53 schreef Ash Berlin-Taylor <ash@apache.org>:
> 
>> We use multiprocessing queues in both the Celery and Kube executors too :D
>> 
>> -a
>> 
>>> On 29 Apr 2019, at 22:51, Driesprong, Fokko <fokko@driesprong.frl>
>> wrote:
>>> 
>>> Thanks for the explanation. I'd go for the manager as well. In the end,
>> if
>>> you really want to scale out, you should go for Celery/Kubernetes.
>>> 
>>> Cheers, Fokko
>>> 
>>> Op ma 29 apr. 2019 om 23:09 schreef Jarek Potiuk <
>> Jarek.Potiuk@polidea.com>:
>>> 
>>>> Yep. Manager() creates a SyncManager instance and runs start().
>>>> This in turn creates a Manager process that all processes communicate
>> with
>>>> via Proxies:
>>>> 
>>>> 
>> https://docs.python.org/3.4/library/multiprocessing.html?highlight=process#sharing-state-between-processes
>>>> 
>>>> It's reliable, but slower than Shared Memory used by direct Queue
>>>> instantiation.
>>>> 
>>>> J.
>>>> 
>>>> On Mon, Apr 29, 2019 at 10:26 PM Bas Harenslak <
>>>> basharenslak@godatadriven.com> wrote:
>>>> 
>>>>> @Fokko I believe that’s implicitly created by
>> multiprocessing.Manager()<
>>>>> 
>>>> 
>> https://docs.python.org/3.4/library/multiprocessing.html?highlight=process#multiprocessing.sharedctypes.multiprocessing.Manager
>>>>>> :
>>>>> 
>>>>> ….. The returned manager object corresponds to a spawned child process
>>>> and
>>>>> ….
>>>>> 
>>>>> I’m in favour of the Manager route, since this doesn’t introduce
>>>>> additional complex multiprocessing code in Airflow. I’m reviewing right
>>>> now
>>>>> although it’s a bit out of my comfort zone...
>>>>> 
>>>>> Bas
>>>>> 
>>>>> On 29 Apr 2019, at 21:46, Driesprong, Fokko <fokko@driesprong.frl
>>>> <mailto:
>>>>> fokko@driesprong.frl>> wrote:
>>>>> 
>>>>> I'm missing the part of another process? This is within the Scheduler
>>>>> process if I understand correctly.
>>>>> 
>>>>> Cheers, Fokko
>>>>> 
>>>>> Op ma 29 apr. 2019 om 21:33 schreef Jarek Potiuk <
>>>> Jarek.Potiuk@polidea.com
>>>>> <mailto:Jarek.Potiuk@polidea.com>>:
>>>>> 
>>>>> I am also leaning towards the manager. I updated the
>>>>> https://github.com/apache/airflow/pull/5200 PR now after review and
>> once
>>>>> it
>>>>> passes CI I think we can merge it.
>>>>> If anyone wants to have a look as well, happy to hear it.
>>>>> 
>>>>> J.
>>>>> 
>>>>> On Mon, Apr 29, 2019 at 2:14 PM Ash Berlin-Taylor <ash@apache.org
>>>> <mailto:
>>>>> ash@apache.org>> wrote:
>>>>> 
>>>>> I think I lean towards the built-in/manager approach as it is less
>>>>> concurrency code we have to manage/maintain in Airflow, though I'm not
>>>>> hugely happy about another process :(
>>>>> 
>>>>> -ash
>>>>> 
>>>>> On 29 Apr 2019, at 07:33, Jarek Potiuk <Jarek.Potiuk@polidea.com
>> <mailto:
>>>>> Jarek.Potiuk@polidea.com>>
>>>>> wrote:
>>>>> 
>>>>> Hello Everyone,
>>>>> 
>>>>> I think we need some more pairs of eyes to take a look at potential
>>>>> fixes
>>>>> we have for the pesky LocalExecutorTest that we are all experiencing
>>>>> with
>>>>> our Travis builds. Once we solve it I think we should be much closer
to
>>>>> have stable builds - including some other flaky test fixes merged
>>>>> recently.
>>>>> 
>>>>> It turned out that the problem relates to quite deep internals of how
>>>>> data
>>>>> is passed between processes using multiprocessing queues. It's really
>>>>> deep
>>>>> in the core processing of Airflow so I think it would be great if also
>>>>> other experienced Airflowers review and comment it and help to select
>>>>> the
>>>>> best solution as we could have missed something.
>>>>> 
>>>>> I was looking at it together with Ash and Bas and I (a bit too fast)
>>>>> merged
>>>>> a preliminary version of the fix last week. We reverted it later as it
>>>>> turned out to have some side effects, so we know we have to be careful
>>>>> with
>>>>> this one.
>>>>> 
>>>>> After more detailed analysis and discussions with Omar, we have now two
>>>>> potential candidates to fix it. Both are green and from local testing
-
>>>>> both are solving the problem in a different way.
>>>>> 
>>>>> - https://github.com/apache/airflow/pull/5199
>>>>> - https://github.com/apache/airflow/pull/5200
>>>>> 
>>>>> I tried to describe the problem, solution candidates with Pros and Cons
>>>>> in
>>>>> the JIRA ticket :
>>>>> https://issues.apache.org/jira/browse/AIRFLOW-4401
>>>>> 
>>>>> I'd love if we can get reviews in the PRs and input to discussion on
>>>>> which
>>>>> solution to choose.
>>>>> 
>>>>> J.
>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> Jarek Potiuk
>>>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>>>> 
>>>>> M: +48 660 796 129 <+48660796129>
>>>>> E: jarek.potiuk@polidea.com<mailto:jarek.potiuk@polidea.com>
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> Jarek Potiuk
>>>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>>>> 
>>>>> M: +48 660 796 129 <+48660796129>
>>>>> E: jarek.potiuk@polidea.com<mailto:jarek.potiuk@polidea.com>
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> 
>>>> Jarek Potiuk
>>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>>> 
>>>> M: +48 660 796 129 <+48660796129>
>>>> E: jarek.potiuk@polidea.com
>>>> 
>> 
>> 


Mime
View raw message