airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jarek Potiuk <Jarek.Pot...@polidea.com>
Subject Re: Longer term Airflow planning
Date Sat, 13 Apr 2019 11:50:57 GMT
Hello Everyone,

Few thoughts which I had after proposing a few changes/PRs/AIPs. I think we
should not only look at it from the system point of view but also from
human psychology and emotions point of view :)

* I personally felt pretty demotivated (that's the emotion part) by seeing
200 opened PRs and 20 AIPS that you do not know which are actively being
worked on, which are abandoned which are ideas only etc.

* It becomes much better (I feel enthusiastic emotions for that) with AIPs
starting getting voted and adopted recently (big +) and the discussions we
had about mentoring/piloting AIPs but I think we have still too many AIPs
per committer/PMC member. Hopefully more committers will be added and some
of the big proposals will get implemented and we will learn all how to get
from AIP to implementation quickly and maybe it will help with speeding up
AIP implementation process.

* I feels pretty sad (emotions again) with PRs though. I think I never gone
past the second page (and some of my PRs are on page 3 or 4 now even if I
actively work on it now). There is simply no way to classify the PRs now
easily.

* However I don't think this should be PMC or committer role to classify
and manage PRs properly. This should be the task of contributor who "owns"
the PR to make sure that his or her PR is properly "promoted". All the
nagging and asking people for review and making sure PR is in a good shape
should be on the side of the person who "owns" it. Otherwise PMC/Committers
will be swamped by purely administrative tasks which is not the point. The
system should be designed in the way that it self-manages with the help of
contributors who should have incentive in managing their PRs properly. I
sympathise (emotions again) with committers/PMC members that they should
actually do an interesting work and contribute where their expertise is
most needed and not loose time for things that are only distractions. So I
think it should be clear for everyone who is the "owner" of the PR to do
all the work necessary to drag attention of PMC/committers to take a look
at their PRs. But it should be clearly stated in Contributors document that
this is the process and expectation from the contributors. And it should be
easy for contributors to know who is the best person to contact (that part
ain't easy now). This is already partially solved by "Recommended" reviewer
in Github but maybe some guidelines on who from the committers is an expert
in which areas - this might be super-helpful.

Similarly we should have a clear guidance on how to label the PRs (by the
owner!!!) and have a rule that requires proper label before reviewer takes
a look at it. This can even be automated by checks on Travis.

*Proposal 1: have a short overview in CONTRIBUTORS with the committers/PMC
members and their area of involvement/expertise*

*Proposal 2: have a rule that PR has to be properly labeled in order to get
committer/PMC member even starts taking a look at it.*

* I think there is a psychological effect that I really like about the
current way we "do not handle and let rot" some of the PRs. I think it's
quite in the nature of such open-source project that people will have many
ideas and some of them will rush in discussing them, proposing sometimes
even opening draft PR but many of those PRs turn out to be
not-that-important and the original author abandons them. And that's fine.
Let those pr to "rot". I think this is how human brain works, that
sometimes you think about a new idea and you start working on it but when
it turns out that it's not that important we abandon it and then it is
fairly difficult to force the author to "clean-up". And that's fine as well
- we should not expect more from the authors. People just work like that
and we won't change it in scale. However it's all that stale, inactive PRs
that are the reason for the "mess" we are experiencing. Why don't we close
all the stale PRs automatically? There are no drawbacks to that, there is
the bot that we can easily employ to do that for us:
https://github.com/probot/stale . There is even explanation in there why
closing PRs automatically is actually good for contributors (otherwise they
get false expectations). And as contributor you can always re-open such PR
easily and effortlessly - marking that the contributor is still interested
in pushing it forward.

*Proposal 3: employ probot to close stale PRs automatically*

J.



On Thu, Apr 11, 2019 at 8:12 AM Deng Xiaodong <xd.deng.r@gmail.com> wrote:

> Some personal thoughts about the PR processing speed specifically.
>
> I'm trying to benchmark Airflow with other Apache projects (like Spark,
> Kafka), in terms of PR reviewing/merging speed: as at this moment, there
> are 400+ open PRs in Spark and 500+ open PRs in Kafka. On the other hand,
> there are 26 committers of Kafka and 68 committers of Spark. For Airflow,
> we have less than 20 committers, and recently the # of open PRs remain at
> about 200.
>
> (highlight: this benchmarking is not 100% precise, as I didn't consider the
> total # of PRs processed per day. But seems the # of commits per day of
> Kafka is roughly close to Airflow)
>
> Don't get me wrong: I never think we have done well enough, and I do agree
> that there is big room for improvement. But to be fair, the situation of
> Airflow here is not that bad.
>
> I was just nominated as a committer about 1 month ago. Earlier as a PR
> submitter, I also had the feeling "why my PRs are processed so slowly"; but
> now when I start to consider more about reviewing/approving/merging, I
> realize the current pace is fairly good (big thanks to the other
> committers).
>
> Another thing I would like to suggest. Currently we committers almost never
> give "-1" for PRs. Even when committers disagree on a change proposal, they
> don’t close it. I would like to suggest PMC to have this discussion:
> whether we can close a PR is we have a few "-1"s from committers (say 3 or
> 4). I believe this would somehow help.
>
>
> Best regards,
>
> XD
>
> On Thu, Apr 11, 2019 at 13:54 airflowuser
> <airflowuser@protonmail.com.invalid> wrote:
>
> > 1. Getting more contributes is important but it's also important to give
> > attention to the current contributes.
> > I noticed that if PR had no reviews and it reached page 3 and above it is
> > likely to be forgotten.
> > take this one for example:
> > https://github.com/apache/airflow/pull/4473
> > The author is required to rebase again. It's not very "welcomey" to new
> > contributes. There are more open PRs like this. One suggestion might be a
> > monthly status check of all open PRs to see if something was missed?
> >
> >
> > 2. The attention of committers doesn't always pointer to what the
> > community needs. Check this one
> > https://github.com/apache/airflow/pull/1936 a problem that bugs many
> > people but there is no discussion how to solve this. There has been more
> > than 4 releases after this PR was introduced and the problem it tries to
> > fix wasn't addressed nor discussed. The author commented that he can
> update
> > the branch but he needs committers to be involved.
> >
> > Again, since everything is volunteer base it make sense and
> understandable
> > however if the project wishes to get more contributors it might be easier
> > to start with the PRs that we already have rather than putting effort on
> > trying to invite new contributors.
> >
> >
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > On Thursday, April 11, 2019 3:46 AM, Aizhamal Nurmamat kyzy
> > <aizhamal@google.com.INVALID> wrote:
> >
> > > Hello all,
> > >
> > > The Beam project has had problems similar to these also. One of the
> > things
> > > they did is formalize how contributions are tracked. I understand that
> > > tracking this sort of information is difficult for the PMC, so if
> there's
> > > interest, I'd be happy to work with the PMC to make tools to track
> > > contributions (e.g. a simple spreadsheet tracking contributions on PRs,
> > > StackOverflow answering, public speaking, documentation, etc). So that
> we
> > > can streamline the "promotion" of new committers. This may also help
> > > incentivize "housekeeping" work, such as triaging of JIRA issues,
> > testing,
> > > release management, etc.
> > >
> > > This may also help provide early feedback to people on track to being a
> > > committer. (e.g. private emails of the kind "hi X. The Airflow PMC has
> > > noticed and appreciates your contributions. We think you could improve
> by
> > > doing Y or Z"
> > >
> > > Let me know what you all think.
> > >
> > > Best,
> > > Aizhamal
> > >
> > > On Wed, Apr 10, 2019 at 5:24 PM Gabriel Silk gsilk@dropbox.com.invalid
> > > wrote:
> > >
> > > > > A lot of the problems that Quantopian experiences with Airflow
> can't
> > be
> > > > > tackled without either "hacks" on top of Airflow; or deep
> reworkings
> > of
> > > > > Airflow components. But that kind of rework is very challenging to
> > > > > implement with the current Airflow contribution process.
> > > >
> > > > Can you elaborate on what some of the problems are that Quantopian
> has
> > > > encountered, which would require significant re-work to Airflow to
> > address?
> > > > On Wed, Apr 10, 2019 at 8:19 AM Driesprong, Fokko
> fokko@driesprong.frl
> > > > wrote:
> > > >
> > > > > Hi James,
> > > > > Adressing your concerns one by one:
> > > > >
> > > > > -   There are a lot of users of Airflow, but their use cases and
> > feature
> > > > >     usage are not well described. Something that seems trivial or
> > unnecessary
> > > > >     to one user turns out to be what someone else's entire workflow
> > depends
> > > > >     on.
> > > > >
> > > > >
> > > > > I think in general it is all about scheduling stuff. For me, this
> is
> > also
> > > > > true for many software packages. 80% of the users only use 20% of
> the
> > > > > functionality. I think it is up to the committers to make sure that
> > we
> > > > > don't remove any functionality too easily, and break the workflow
> for
> > > > > others. However, sometimes this is what you want, for example
> > dropping
> > > > > Python 2 support. I strongly believe that the flexibility offered
> by
> > > > > Airflow is both a strength and a weakness, it allows you to do
> > virtually
> > > > > everything, on the other hand, maybe you should not do that :-)
> > > > >
> > > > > -   The Airflow JIRA feels completely unmaintained. Most of the
> > issues I've
> > > > >     reported have never even been acknowledged, and it's hard to
> > know what
> > > > >     versions an issue applies to. This makes it hard to know what
> to
> > work on
> > > > >     or
> > > > >     what would be most impactful to other users.
> > > > >
> > > > >
> > > > > Keeping track of Jira is a full-time job. Periodically I go through
> > all
> > > > > the
> > > > > tickets, but it is also (mis)used for dumping stack traces, or any
> > other
> > > > > error. We should be more strict on this. As a community. If you're
> > > > > interested in doing this, let me know so I can grand you editor
> > > > > permissions.
> > > > >
> > > > > -   Hacking on Airflow is challenging, especially if you need to
> run
> > a real
> > > > >     workload to examine your changes. (I saw the work for an
> > improved local
> > > > >     dev
> > > > >     process - great stuff!)
> > > > >
> > > > >
> > > > > This is a known problem. I think the community is doing an awesome
> > job
> > > > > here. For example, Breeze by Polidea (
> > > > > https://www.youtube.com/watch?v=ffKFHV6f3PQ) and Whirl by
> > > > > ING/GoDataDriven (
> > > > >
> https://blog.godatadriven.com/open-source-airflow-local-development
> > ).
> > > > >
> > > > > -   Keeping track of what's on master vs. what's in a release is
> > > > >     challenging,
> > > > >     particularly since so many commits are for operators we'll
> never
> > use. (I
> > > > >     know there's some discussion about breaking operators into
> their
> > own
> > > > >     repos,
> > > > >     and I hope that goes through.)
> > > > >
> > > > >
> > > > > The main job of the committers is to keep compatibility on the
> > > > > interfaces.
> > > > > The versions are clearly set in Jira when a ticket is being worked
> > on.
> > > > > Based on if the change is compatible with the new minor version,
it
> > will
> > > > > be
> > > > > included, otherwise, it will be set to the next major version.
> > > > >
> > > > > -   The PMCs are too busy to guarantee timely reviews, and rebasing
> > is
> > > > >     extremely costly with how much code reorganization is
> happening.
> > This
> > > > >     strongly discourages putting in time to develop anything other
> > than
> > > > >     relatively isolated features, often new features.
> > > > >
> > > > >
> > > > > The code grew rapidly over time. This required to reorganize a lot
> of
> > > > > code.
> > > > > This is required to keep development possible and make the code
> more
> > > > > accessible to newcomers. For example the splitting up of the
> infamous
> > > > > models.py (a file with well over 6k lines), was quite a pain with
> > > > > circular
> > > > > imports. This is periodically necessary to keep the code organized.
> > > > > Please
> > > > > note that it isn't a task for only the PMC to do reviewing. But
> this
> > is
> > > > > also for the committers and contributors. If there any
> > functionalities
> > > > > that
> > > > > you use a lot, please also provide reviews on that topic.
> > > > > For me, being committer and PMC on the project is just something
> > that I
> > > > > do
> > > > > out of passion for Airflow. It isn't my job and I don't get paid
> for
> > it.
> > > > > That being said, I do agree with getting more committers on board
> to
> > > > > strengthen the workforce.
> > > > > We're now preparing for Airflow 2.0, including a couple of AIP's.
> The
> > > > > question if there will be a true container-native, or cloud-native
> > > > > version
> > > > > of Airflow, is completely up to you and the community. I'm in favor
> > of
> > > > > jumping on the container train, but this requires to rework on the
> > > > > codebase
> > > > > of Airflow.
> > > > > Cheers, Fokko
> > > > > Op wo 10 apr. 2019 om 16:56 schreef Szymon Przedwojski <
> > > > > szymon.przedwojski@polidea.com>:
> > > > >
> > > > > > I think it is quite clear that Airflow needs more committers.
> > > > > > Looking at AIPs, PRs and this devlist there are quite a few
> active
> > > > > > people
> > > > >
> > > > > > who might be a good fit to become them.
> > > > > > With the community and the project growing I think this should
be
> > > > > > natural
> > > > >
> > > > > > to increase the number of committers as well. I know there comes
> a
> > new
> > > > > > committer every now and then, but maybe it’s still not enough
and
> > maybe
> > > > > > Airflow should recruit them more “aggressively”?
> > > > > > Szymon Przedwojski
> > > > > > Polidea | Software Engineer
> > > > > > M: +48 500 330 790
> > > > > > E: szymon.przedwojski@polidea.com
> > > > > >
> > > > > > > On 10 Apr 2019, at 16:47, airflowuser <
> > airflowuser@protonmail.com
> > > > > > > .INVALID>
> > > > > > > wrote:
> > > > > >
> > > > > > > The Jira is a mess and it require committers time to organize
> it.
> > > > > > > Ideally users should report issues and committers should
tag
> them
> > > > > > > with
> > > > >
> > > > > > priority, milestone / fix version, labels (This is how for
> example
> > > > > > it's
> > > > >
> > > > > > done with https://github.com/pandas-dev/pandas )
> > > > > >
> > > > > > > When I have time I try to stack list of Jira issues that
> require
> > > > > > > committers attention and ashb fix them but it's progressing
> > slowly.
> > > > > > > I think that at least it would be great if the version
field in
> > the
> > > > > > > Jira
> > > > > > > will be mandatory when user submit ticket.
> > > > > >
> > > > > > > At the end... committers simply don't have time for this.
They
> > don't
> > > > > > > have enough time for reviewing PRs as well so I doubt something
> > will
> > > > > > > change
> > > > > > > in the near future.
> > > > > >
> > > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > > > > On Wednesday, April 10, 2019 5:18 PM, James Meickle <
> > > > > > > jmeickle@quantopian.com.INVALID> wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > > I've been following Airflow development fairly actively
for
> > over a
> > > > > > > > year. In
> > > > > > >
> > > > > > > > that time, the company I work at (Quantopian) has
gone all-in
> > on
> > > > > > > > Airflow.
> > > > > > >
> > > > > > > > It's a core part of our business and required for
daily
> > operations.
> > > > > > > > However, I've had some concerns over the future of
the
> > project. Part
> > > > > > > > of
> > > > > >
> > > > > > > > these concerns are because it's difficult to contribute
to
> > Airflow:
> > > > > > > >
> > > > > > > > -   There are a lot of users of Airflow, but their
use cases
> > and
> > > > > > > >     feature
> > > > > > > >
> > > > > >
> > > > > > > > usage are not well described. Something that seems
trivial or
> > > > > > > > unnecessary
> > > > > > >
> > > > > > > > to one user turns out to be what someone else's entire
> workflow
> > > > > > > > depends on.
> > > > > > >
> > > > > > > > -   The Airflow JIRA feels completely unmaintained.
Most of
> the
> > > > > > > >     issues
> > > > > > > >
> > > > >
> > > > > > I've
> > > > > >
> > > > > > > > reported have never even been acknowledged, and it's
hard to
> > know
> > > > > > > > what
> > > > > > >
> > > > > > > > versions an issue applies to. This makes it hard to
know what
> > to
> > > > > > > > work on or
> > > > > > >
> > > > > > > > what would be most impactful to other users.
> > > > > > > >
> > > > > > > > -   Hacking on Airflow is challenging, especially
if you need
> > to
> > > > > > > >     run a
> > > > > > > >
> > > > >
> > > > > > real
> > > > > >
> > > > > > > > workload to examine your changes. (I saw the work
for an
> > improved
> > > > > > > > local dev
> > > > > > >
> > > > > > > > process - great stuff!)
> > > > > > > >
> > > > > > > > -   Keeping track of what's on master vs. what's in
a release
> > is
> > > > > > > >     challenging,
> > > > > > > >
> > > > > > >
> > > > > > > > particularly since so many commits are for operators
we'll
> > never
> > > > > > > > use. (I
> > > > > > >
> > > > > > > > know there's some discussion about breaking operators
into
> > their
> > > > > > > > own
> > > > > > > > repos,
> > > > > >
> > > > > > > > and I hope that goes through.)
> > > > > > > >
> > > > > > > > -   The PMCs are too busy to guarantee timely reviews,
and
> > rebasing
> > > > > > > >     is
> > > > > > > >
> > > > >
> > > > > > > > extremely costly with how much code reorganization
is
> > happening.
> > > > > > > > This
> > > > > >
> > > > > > > > strongly discourages putting in time to develop anything
> other
> > > > > > > > than
> > > > >
> > > > > > > > relatively isolated features, often new features.
> > > > > > > > A lot of the problems that Quantopian experiences
with
> Airflow
> > > > > > > > can't
> > > > > > > > be
> > > > > >
> > > > > > > > tackled without either "hacks" on top of Airflow;
or deep
> > > > > > > > reworkings
> > > > > > > > of
> > > > > >
> > > > > > > > Airflow components. But that kind of rework is very
> challenging
> > > > > > > > to
> > > > >
> > > > > > > > implement with the current Airflow contribution process.
> > > > > > > > I'm glad that we've recently adopted AIPs, but the
way we're
> > > > > > > > using
> > > > >
> > > > > > them
> > > > > >
> > > > > > > > seems better suited to planning isolated features.
The
> Airflow
> > > > > > > > project does
> > > > > > >
> > > > > > > > not have a well-maintained roadmap, nor any mechanism
to
> > produce
> > > > > > > > one
> > > > > > > > by
> > > > > >
> > > > > > > > weighing AIPs based on synergy vs. developer interest
vs.
> user
> > > > > > > > interest.
> > > > > > >
> > > > > > > > I think that this lack of long-term planning makes
it even
> more
> > > > > > > > challenging
> > > > > > >
> > > > > > > > to propose larger reworks that might require multiple
AIPs to
> > > > > > > > implement,
> > > > > > >
> > > > > > > > each of which individually might yield little benefit.
I
> worry
> > > > > > > > that
> > > > >
> > > > > > we may
> > > > > >
> > > > > > > > approve a series of "promising" AIPs that, taken together,
> > don't
> > > > > > > > amount to
> > > > > > >
> > > > > > > > anything greater than a "pile of new features"; instead
of
> > > > > > > > balancing
> > > > > >
> > > > > > > > feature improvements with platform improvements that
will
> > unlock
> > > > > > > > more
> > > > > >
> > > > > > > > fundamental changes to how Airflow can work.
> > > > > > > > I'd like to see some discussion of what it would look
like to
> > set
> > > > > > > > long term
> > > > > > >
> > > > > > > > goals for Airflow. What is Airflow 2 going to look
like? How
> > much
> > > > > > > > backwards
> > > > > > >
> > > > > > > > compat will it break? When should we expect Airflow
3? Are
> they
> > > > > > > > going to be
> > > > > > >
> > > > > > > > "business as usual" releases, or will they embrace
any new
> > > > > > > > concepts
> > > > >
> > > > > > or
> > > > > >
> > > > > > > > idioms? Will there be a true container-native, or
> cloud-native
> > > > > > > > version of
> > > > > > >
> > > > > > > > Airflow? Will we work to be better for current users,
or to
> > > > > > > > embrace
> > > > >
> > > > > > new
> > > > > >
> > > > > > > > classes of users?
> > > > > > > > I have some thoughts of my own, of course, but I'd
like to
> hear
> > > > > > > > what
> > > > > > > > other
> > > > > >
> > > > > > > > people have to say on this topic first!
> >
> >
> >
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
E: jarek.potiuk@polidea.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message