reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mariia Mykhailova <mamyk...@microsoft.com.INVALID>
Subject RE: [Discuss] Timely releases with known issues vs. rare issue-free releases
Date Wed, 12 Apr 2017 03:39:52 GMT
First, we should note that 0.16 is not the first release preceded by a rather long bug hunt
in our code; we spent at least a month fixing bugs for each 0.15 and 0.14. We tend to prioritize
work on actual features between releases, and we either start investigating known test failures
only before the release, or find bugs only when we start testing release candidate on various
platforms/machines/environments. During the previous release discussions we used to agree
that we shouldn’t release code with known reproducible test failures. 

Second, the fact that the test failure only happens in CI (as opposed to our dev boxes) doesn't
imply that this isn't an actual bug in our code, that it can't happen in production or that
it won't be bad if it happens. We have a history of actual nasty bugs in code which we discovered
only using CI servers, and they were very much not obvious (which is why it takes so long
to track them down). Two of the bugs which are the freshest in my memory were related to REEF
job never terminating; I don't think doing a release with a known issue "REEF job sometimes
never terminates" (that will stay a known issue for at least half a year) is doing a service
to our users.

On the other hand, the nature of our CI servers is such that occasional test failures are
inevitable. I've observed a lot of one-time transient test failures which I didn't consider
JIRA-worthy because they were obviously caused by some resource issue on the CI server. Besides,
our .NET tests, especially IMRU ones, deal with a lot of concurrency and may sometimes not
account for certain valid sequences of events.

Right now we have a whole list of problems:
1. Test failures don't necessarily indicate a bug in the system, but they might. 
2. We don't have sufficient understanding of when the failures indicate a serious bug and
when they don't.
3. The tests we run in CI don't reflect the actual usage patterns of REEF in production (we
only run tests on local runtime, not on Yarn)
4. We don't have sufficient motivation to debug until we approach a release.
5. When we approach a release, we have motivation to mark any failures as known issues and
proceed :-)

I guess we should balance our belief that the test failures are benign/not frequent enough
to cause problems if used in production vs our need to have some motivation for investigating
them (if we always dismiss them as transient CI failures, we'll be missing actual bugs). 

-Mariia

-----Original Message-----
From: Taegeon Um [mailto:taegeonum@gmail.com] 
Sent: Tuesday, April 11, 2017 6:37 PM
To: dev@reef.apache.org
Subject: Re: [Discuss] Timely releases with known issues vs. rare issue-free releases

Hi,

I am totally agree with Markus on timely releases and making a release with a `known issues`
section.
If the test failures are rare, it would be good to note down the known issues and move forward
to the release.

My main concern is that "until when do we have to keep them as known issues?". We cannot keep
them as known issues forever.
I think at least we need a cut-off date. For example, if we find transient failures in 0.16
snapshot, we should resolve them until the next release (0.17)?

Thanks,
Taegeon


2017. 4. 12. 오전 6:09에 "Byung-Gon Chun" <bgchun@gmail.com>님이 작성:

I understand this concern. It's been about a month since we started to talk about release
0.16. Also, our last release occurred long time ago.

Mariia, Taegeon, Sergiy, and Julia have been working to fix test failure issues for release
0.16. Since we heard from Julia, I hope to hear from Mariia, Taegeon, and Sergiy as well.
What're your thoughts?



On Wed, Apr 12, 2017 at 3:15 AM, Julia Wang (QIUHE) < Qiuhe.Wang@microsoft.com.invalid>
wrote:

> I totally agree with Markus's comments.
>
> If we have test failures that show some bugs in the system and impact 
> the quality of the code, we should resolve before a release.
>
> Current few transit test failures only happen on AppVayer. It could be 
> test issue that hit some edge scenarios, most possibly is related to 
> the timing of the events. I have fixed some of them couple of weeks 
> ago, like we didn't dispose active context in one of the test code (it 
> was a bug in test), the validation condition in a test was too strict 
> as the event may be received in different sequence, we depended on log 
> messages in test handlers which may receive events later than driver receives, etc.
>
> The failing tests may be just test issues, may imply some system 
> issues, we don't know yet. But as long as there is no obvious defect 
> in the
current
> code base, I would think it should not block the release.
>
> Thanks,
> Julia
>
> -----Original Message-----
> From: Markus Weimer [mailto:markus@weimo.de]
> Sent: Tuesday, April 11, 2017 9:26 AM
> To: REEF Developers Mailinglist <dev@reef.apache.org>
> Subject: [Discuss] Timely releases with known issues vs. rare 
> issue-free releases
>
> Hi,
>
> the current saga of ever not fully completing integration tests 
> reminded me that we never actually had a discussion about what our bar 
> for a
release
> is.
>
>
> Our informal agreement right now seems to be that we want all the 
> integration tests to finish all the time on our CI servers. I admire 
> our dedication to high-quality releases, and don't want to distract from it.
> Over time, we must fix all of those issues and strive to provide the 
> most stable software we are capable off.
>
> At the same time, I haven't had a test failure on any of my own 
> machines when reviewing pull requests in a very long time. Which makes 
> me wonder whether our CI servers just set us up for failure. I believe 
> the failures on the CI servers are real, and point to interesting edge 
> cases in REEF we haven't fully solved. Hence, we absolutely must investigate and fix
them.
>
> However, I am not sure whether this needs to happen in a way that 
> blocks the next release. Because there is a competing interest in 
> timely
releases.
> Our last actual release, 0.15, was in May of 2016*.
> At that time, we did not have IMRU, we did not have a working group 
> communications and a bunch of bug fixes were not in yet. Our current 
> `master` is all around better than that release. Hence, we'd do our 
> users
a
> service by making a release with a `known issues` section in the 
> release notes. This also would help us get feedback on the current 
> code from
actual
> users, as many won't (and shouldn't) use a developer version.
>
> In summary, we are faced with two opposing goals: (1) Fixing all the 
> known issues before a release to make that release the best it can be 
> and (2) release frequently to get our latest fixes and features out.
>
> What do you think? Which approach would you like us to follow?
>
> Thanks!
>
> Markus
>
>
> *: Let's just not talk about the disaster that is 0.15.1
>



--
Byung-Gon Chun
Mime
View raw message