airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jarek Potiuk <Jarek.Pot...@polidea.com>
Subject Re: Mutli-layered official image for Airflow
Date Mon, 18 Mar 2019 23:00:31 GMT
After some initial discussion and suggestion from Daniel, I split the
change into three separate PRs which can be reviewed and merged separately:


   - AIRFLOW-4115 JIRA <https://issues.apache.org/jira/browse/AIRFLOW-4115>
   , PR <https://github.com/apache/airflow/pull/4936> - Docker file for
   Main airflow image is multi-staging and has multiple layers

followed by

   - AIRFLOW-4116 JIRA <https://issues.apache.org/jira/browse/AIRFLOW-4116>
   , PR <https://github.com/apache/airflow/pull/4937> - Support for Main/CI
   images in single Dockerfile

followed by

   - AIRFLOW-4117 JIRA <https://issues.apache.org/jira/browse/AIRFLOW-4117>
   , PR <https://github.com/apache/airflow/pull/4938>- Travis CI uses
   multi-stage Docker image to run tests


J.

On Mon, Mar 18, 2019 at 2:23 AM Jarek Potiuk <Jarek.Potiuk@polidea.com>
wrote:

> Hello everyone,
>
> I believe I am ready now to involve more of the community people in the
> multi-layered Docker AIP-10 that I am working on for some time (with
> comments and encouragement from Ash and Fokko as explained in the AIP
> thread).
>
> Any comments, questions, critique, improvement proposals, or even help :)
> is more than welcome.
>
> The work is still WIP: https://github.com/apache/airflow/pull/4543
>
> The AIP Confluence page (fairly detailed already) is in
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+and+multi-stage+official+Airflow+image
> - I think it is the best place for the discussion (as Bas suggested in the
> AIP thread)
>
> I am still working on making the tests on Travis green, but I am on a good
> path. I'd appreciate any help with it. Especially with the Kubernetes tests
> which will likely need some small fixes in the environment or maybe even
> switching to minikube's Docker image in docker-compose.
>
> What works now (I think it addresses quite a lot of the concerns Fokko
> mentioned):
>
>    - Tox is removed and replaced with pure-docker execution of tests
>    (yay!)
>    - The same Dockerfile is used for both "slim" Airflow image and
>    Airflow CI image used for tests. Once we merge it, we will be able to
>    deprecate incubator-airflow-ci image.
>    - Part of the PR is also related to "Simplified development
>    environment - AIP-7" (aka Airflow Breeze). I have a nice working Breeze
>    environment as part of the change now - I will split it off eventually to
>    separate discussion/PR but for now it makes it easier for me to run tests
>    so I keep it in.
>    - The Multi-staging/multi-layered Dockerfile should already improve CI
>    build "purity". The way "layers" work now is that PIP dependencies are
>    effectively frozen in-between setup.py changes. Only when setup.py changes,
>    the corresponding layers are rebuilt and dependencies re-installed. That
>    should provide 'out-of-the-box" better stability of CI builds even before
>    we solve dependency problem in more "systematic" way (as Fokko mentioned we
>    should have separate AIP for that). I am happy to discuss more - either now
>    or in the future AIP. It's quite close to my interest to fix this
>    eventually as well.
>
> I went through several iterations and what I came up with is already quite
> simple and straightforward comparing to some initial approaches I took.
>
> I added quite detailed description and motivation, proposed design and
> even measured the impact of layering on build times (All in AIP-10
> Confluence page).
>
> I will continue fixing tests and rebasing the changes for some time (even
> few weeks if needed) to test how it behaves with real changes coming
> regularly.
>
> For now it's done in the way that I have separate DockerHub build and
> Travis CI instance where I will keep on running the tests (automatically):
>
>    - DockerHub:
>    https://cloud.docker.com/repository/docker/potiuk/airflow/timeline
>    - Travis CI: https://travis-ci.org/potiuk/airflow/builds
>
> J.
>
>
>
> On Thu, Jan 17, 2019 at 12:12 PM Jarek Potiuk <Jarek.Potiuk@polidea.com>
> wrote:
>
>> I've updated the calculations after removing some artifacts and rebulding
>> the images from scratch. Here are the updated conclusions:
>>
>>
>>    - The multi-layered image is only slightly bigger than the
>>    mono-layered one (around *2% more *in total ) - download time is also
>>    slightly longer by 1 s  (33.7 vs 32.7s) which is *3% longer.*
>>    - Downloading the image regularly by the users is way better in case
>>    of multi-layered image - for simulated user, downloading airflow image
>>    twice a week it is:  *4950 MB*  (multi-layered) vs. *13546 MB*
>>    (mono-layered) downloads over the course of 8 weeks. Yielding *64%
>>    less data* to download.
>>    - Multi-layered image seems to be much better for users regularly
>>    downloading the image.
>>
>>
>> On Wed, Jan 16, 2019 at 10:59 PM Jarek Potiuk <Jarek.Potiuk@polidea.com>
>> wrote:
>>
>>> Hello Everyone,
>>>
>>> Following the discussion we had on Mono-layered vs. Multi-layered
>>> official image for Airflow here
>>> https://github.com/apache/airflow/pull/4483, I prepared a
>>> proof-of-concept PR of multi-layered image (based on the mono-layered one)
>>> and I performed calculations and reached some conclusions in this proposal
>>> (I wanted to have some hard numbers to back the statement that
>>> multi-layered Docker file is better) :
>>>
>>>
>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+official+Airflow+image
>>>
>>> The conclusions I reached:
>>>
>>>    - The multi-layered image is even slightly smaller than the
>>>    mono-layered one - so multi-layered image is even better when you download
>>>    it once
>>>    - Downloading the image regularly by the users is way better in case
>>>    of multi-layered image - for simulated user, downloading airflow image
>>>    twice a week it is:  5.7 GB  (multi-layered) vs. 16.15 GB (mono-layered)
>>>    downloads over the course of 8 weeks.\
>>>    - Multi-layered image is better choice.
>>>
>>>
>>> I based those calculations on the PR I prepared:
>>> https://github.com/apache/airflow/pull/4543 where I implemented rather
>>> nice multi-layered Dockerfile that can be easily maintained.
>>>
>>> It's  based on my experience with Airflow Breeze
>>> <https://github.com/PolideaInternal/airflow-breeze> - the GCP
>>> Development environment we used to develop 30+ GCP based operators recently.
>>>
>>> I hope we can reach the conclusion as the community that multi-layered
>>> is better and that we can go in this direction :). I am happy to iterate on
>>> my PR to make it even better.
>>>
>>> J.
>>>
>>>
>>> --
>>>
>>> Jarek Potiuk
>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>>
>>> M: +48 660 796 129 <+48660796129>
>>> E: jarek.potiuk@polidea.com
>>>
>>
>>
>> --
>>
>> Jarek Potiuk
>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>
>> M: +48 660 796 129 <+48660796129>
>> E: jarek.potiuk@polidea.com
>>
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> E: jarek.potiuk@polidea.com
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
E: jarek.potiuk@polidea.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message