airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jarek Potiuk <Jarek.Pot...@polidea.com>
Subject Re: Mutli-layered official image for Airflow
Date Mon, 18 Mar 2019 01:23:54 GMT
Hello everyone,

I believe I am ready now to involve more of the community people in the
multi-layered Docker AIP-10 that I am working on for some time (with
comments and encouragement from Ash and Fokko as explained in the AIP
thread).

Any comments, questions, critique, improvement proposals, or even help :)
is more than welcome.

The work is still WIP: https://github.com/apache/airflow/pull/4543

The AIP Confluence page (fairly detailed already) is in
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+and+multi-stage+official+Airflow+image
- I think it is the best place for the discussion (as Bas suggested in the
AIP thread)

I am still working on making the tests on Travis green, but I am on a good
path. I'd appreciate any help with it. Especially with the Kubernetes tests
which will likely need some small fixes in the environment or maybe even
switching to minikube's Docker image in docker-compose.

What works now (I think it addresses quite a lot of the concerns Fokko
mentioned):

   - Tox is removed and replaced with pure-docker execution of tests (yay!)
   - The same Dockerfile is used for both "slim" Airflow image and Airflow
   CI image used for tests. Once we merge it, we will be able to deprecate
   incubator-airflow-ci image.
   - Part of the PR is also related to "Simplified development environment
   - AIP-7" (aka Airflow Breeze). I have a nice working Breeze environment as
   part of the change now - I will split it off eventually to separate
   discussion/PR but for now it makes it easier for me to run tests so I keep
   it in.
   - The Multi-staging/multi-layered Dockerfile should already improve CI
   build "purity". The way "layers" work now is that PIP dependencies are
   effectively frozen in-between setup.py changes. Only when setup.py changes,
   the corresponding layers are rebuilt and dependencies re-installed. That
   should provide 'out-of-the-box" better stability of CI builds even before
   we solve dependency problem in more "systematic" way (as Fokko mentioned we
   should have separate AIP for that). I am happy to discuss more - either now
   or in the future AIP. It's quite close to my interest to fix this
   eventually as well.

I went through several iterations and what I came up with is already quite
simple and straightforward comparing to some initial approaches I took.

I added quite detailed description and motivation, proposed design and even
measured the impact of layering on build times (All in AIP-10 Confluence
page).

I will continue fixing tests and rebasing the changes for some time (even
few weeks if needed) to test how it behaves with real changes coming
regularly.

For now it's done in the way that I have separate DockerHub build and
Travis CI instance where I will keep on running the tests (automatically):

   - DockerHub:
   https://cloud.docker.com/repository/docker/potiuk/airflow/timeline
   - Travis CI: https://travis-ci.org/potiuk/airflow/builds

J.



On Thu, Jan 17, 2019 at 12:12 PM Jarek Potiuk <Jarek.Potiuk@polidea.com>
wrote:

> I've updated the calculations after removing some artifacts and rebulding
> the images from scratch. Here are the updated conclusions:
>
>
>    - The multi-layered image is only slightly bigger than the
>    mono-layered one (around *2% more *in total ) - download time is also
>    slightly longer by 1 s  (33.7 vs 32.7s) which is *3% longer.*
>    - Downloading the image regularly by the users is way better in case
>    of multi-layered image - for simulated user, downloading airflow image
>    twice a week it is:  *4950 MB*  (multi-layered) vs. *13546 MB*
>    (mono-layered) downloads over the course of 8 weeks. Yielding *64%
>    less data* to download.
>    - Multi-layered image seems to be much better for users regularly
>    downloading the image.
>
>
> On Wed, Jan 16, 2019 at 10:59 PM Jarek Potiuk <Jarek.Potiuk@polidea.com>
> wrote:
>
>> Hello Everyone,
>>
>> Following the discussion we had on Mono-layered vs. Multi-layered
>> official image for Airflow here
>> https://github.com/apache/airflow/pull/4483, I prepared a
>> proof-of-concept PR of multi-layered image (based on the mono-layered one)
>> and I performed calculations and reached some conclusions in this proposal
>> (I wanted to have some hard numbers to back the statement that
>> multi-layered Docker file is better) :
>>
>>
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+official+Airflow+image
>>
>> The conclusions I reached:
>>
>>    - The multi-layered image is even slightly smaller than the
>>    mono-layered one - so multi-layered image is even better when you download
>>    it once
>>    - Downloading the image regularly by the users is way better in case
>>    of multi-layered image - for simulated user, downloading airflow image
>>    twice a week it is:  5.7 GB  (multi-layered) vs. 16.15 GB (mono-layered)
>>    downloads over the course of 8 weeks.\
>>    - Multi-layered image is better choice.
>>
>>
>> I based those calculations on the PR I prepared:
>> https://github.com/apache/airflow/pull/4543 where I implemented rather
>> nice multi-layered Dockerfile that can be easily maintained.
>>
>> It's  based on my experience with Airflow Breeze
>> <https://github.com/PolideaInternal/airflow-breeze> - the GCP
>> Development environment we used to develop 30+ GCP based operators recently.
>>
>> I hope we can reach the conclusion as the community that multi-layered is
>> better and that we can go in this direction :). I am happy to iterate on my
>> PR to make it even better.
>>
>> J.
>>
>>
>> --
>>
>> Jarek Potiuk
>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>
>> M: +48 660 796 129 <+48660796129>
>> E: jarek.potiuk@polidea.com
>>
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> E: jarek.potiuk@polidea.com
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
E: jarek.potiuk@polidea.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message