reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Byung-Gon Chun <bgc...@gmail.com>
Subject Re: [Discuss] Timely releases with known issues vs. rare issue-free releases
Date Wed, 12 Apr 2017 05:07:15 GMT
Mariia, thanks for the insightful note.

It's hard to decide ( :( ) since both waiting to fix failures and releasing
not too late make sense.




On Wed, Apr 12, 2017 at 12:39 PM, Mariia Mykhailova <
mamykhai@microsoft.com.invalid> wrote:

> First, we should note that 0.16 is not the first release preceded by a
> rather long bug hunt in our code; we spent at least a month fixing bugs for
> each 0.15 and 0.14. We tend to prioritize work on actual features between
> releases, and we either start investigating known test failures only before
> the release, or find bugs only when we start testing release candidate on
> various platforms/machines/environments. During the previous release
> discussions we used to agree that we shouldn’t release code with known
> reproducible test failures.
>
> Second, the fact that the test failure only happens in CI (as opposed to
> our dev boxes) doesn't imply that this isn't an actual bug in our code,
> that it can't happen in production or that it won't be bad if it happens.
> We have a history of actual nasty bugs in code which we discovered only
> using CI servers, and they were very much not obvious (which is why it
> takes so long to track them down). Two of the bugs which are the freshest
> in my memory were related to REEF job never terminating; I don't think
> doing a release with a known issue "REEF job sometimes never terminates"
> (that will stay a known issue for at least half a year) is doing a service
> to our users.
>
> On the other hand, the nature of our CI servers is such that occasional
> test failures are inevitable. I've observed a lot of one-time transient
> test failures which I didn't consider JIRA-worthy because they were
> obviously caused by some resource issue on the CI server. Besides, our .NET
> tests, especially IMRU ones, deal with a lot of concurrency and may
> sometimes not account for certain valid sequences of events.
>
> Right now we have a whole list of problems:
> 1. Test failures don't necessarily indicate a bug in the system, but they
> might.
> 2. We don't have sufficient understanding of when the failures indicate a
> serious bug and when they don't.
> 3. The tests we run in CI don't reflect the actual usage patterns of REEF
> in production (we only run tests on local runtime, not on Yarn)
> 4. We don't have sufficient motivation to debug until we approach a
> release.
> 5. When we approach a release, we have motivation to mark any failures as
> known issues and proceed :-)
>
> I guess we should balance our belief that the test failures are benign/not
> frequent enough to cause problems if used in production vs our need to have
> some motivation for investigating them (if we always dismiss them as
> transient CI failures, we'll be missing actual bugs).
>
> -Mariia
>
> -----Original Message-----
> From: Taegeon Um [mailto:taegeonum@gmail.com]
> Sent: Tuesday, April 11, 2017 6:37 PM
> To: dev@reef.apache.org
> Subject: Re: [Discuss] Timely releases with known issues vs. rare
> issue-free releases
>
> Hi,
>
> I am totally agree with Markus on timely releases and making a release
> with a `known issues` section.
> If the test failures are rare, it would be good to note down the known
> issues and move forward to the release.
>
> My main concern is that "until when do we have to keep them as known
> issues?". We cannot keep them as known issues forever.
> I think at least we need a cut-off date. For example, if we find transient
> failures in 0.16 snapshot, we should resolve them until the next release
> (0.17)?
>
> Thanks,
> Taegeon
>
>
> 2017. 4. 12. 오전 6:09에 "Byung-Gon Chun" <bgchun@gmail.com>님이 작성:
>
> I understand this concern. It's been about a month since we started to
> talk about release 0.16. Also, our last release occurred long time ago.
>
> Mariia, Taegeon, Sergiy, and Julia have been working to fix test failure
> issues for release 0.16. Since we heard from Julia, I hope to hear from
> Mariia, Taegeon, and Sergiy as well. What're your thoughts?
>
>
>
> On Wed, Apr 12, 2017 at 3:15 AM, Julia Wang (QIUHE) <
> Qiuhe.Wang@microsoft.com.invalid> wrote:
>
> > I totally agree with Markus's comments.
> >
> > If we have test failures that show some bugs in the system and impact
> > the quality of the code, we should resolve before a release.
> >
> > Current few transit test failures only happen on AppVayer. It could be
> > test issue that hit some edge scenarios, most possibly is related to
> > the timing of the events. I have fixed some of them couple of weeks
> > ago, like we didn't dispose active context in one of the test code (it
> > was a bug in test), the validation condition in a test was too strict
> > as the event may be received in different sequence, we depended on log
> > messages in test handlers which may receive events later than driver
> receives, etc.
> >
> > The failing tests may be just test issues, may imply some system
> > issues, we don't know yet. But as long as there is no obvious defect
> > in the
> current
> > code base, I would think it should not block the release.
> >
> > Thanks,
> > Julia
> >
> > -----Original Message-----
> > From: Markus Weimer [mailto:markus@weimo.de]
> > Sent: Tuesday, April 11, 2017 9:26 AM
> > To: REEF Developers Mailinglist <dev@reef.apache.org>
> > Subject: [Discuss] Timely releases with known issues vs. rare
> > issue-free releases
> >
> > Hi,
> >
> > the current saga of ever not fully completing integration tests
> > reminded me that we never actually had a discussion about what our bar
> > for a
> release
> > is.
> >
> >
> > Our informal agreement right now seems to be that we want all the
> > integration tests to finish all the time on our CI servers. I admire
> > our dedication to high-quality releases, and don't want to distract from
> it.
> > Over time, we must fix all of those issues and strive to provide the
> > most stable software we are capable off.
> >
> > At the same time, I haven't had a test failure on any of my own
> > machines when reviewing pull requests in a very long time. Which makes
> > me wonder whether our CI servers just set us up for failure. I believe
> > the failures on the CI servers are real, and point to interesting edge
> > cases in REEF we haven't fully solved. Hence, we absolutely must
> investigate and fix them.
> >
> > However, I am not sure whether this needs to happen in a way that
> > blocks the next release. Because there is a competing interest in
> > timely
> releases.
> > Our last actual release, 0.15, was in May of 2016*.
> > At that time, we did not have IMRU, we did not have a working group
> > communications and a bunch of bug fixes were not in yet. Our current
> > `master` is all around better than that release. Hence, we'd do our
> > users
> a
> > service by making a release with a `known issues` section in the
> > release notes. This also would help us get feedback on the current
> > code from
> actual
> > users, as many won't (and shouldn't) use a developer version.
> >
> > In summary, we are faced with two opposing goals: (1) Fixing all the
> > known issues before a release to make that release the best it can be
> > and (2) release frequently to get our latest fixes and features out.
> >
> > What do you think? Which approach would you like us to follow?
> >
> > Thanks!
> >
> > Markus
> >
> >
> > *: Let's just not talk about the disaster that is 0.15.1
> >
>
>
>
> --
> Byung-Gon Chun
>



-- 
Byung-Gon Chun

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message