cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Stupp <sn...@snazy.de>
Subject Re: 3.0 and the Cassandra release process
Date Wed, 18 Mar 2015 08:49:49 GMT
+1

I also appreciate Ariel’s effort. The improved CI integration is great - being able to run
a huge amount of tests on different platforms against one's development branch is a huge improvement.


> Am 17.03.2015 um 22:06 schrieb Jonathan Ellis <jbellis@gmail.com>:
> 
> Cassandra 2.1 was released in September, which means that if we were on
> track with our stated goal of six month releases, 3.0 would be done about
> now.  Instead, we haven't even delivered a beta.  The immediate cause this
> time is blocking for 8099
> <https://issues.apache.org/jira/browse/CASSANDRA-8099>, but the reality is
> that nobody should really be surprised.  Something always comes up -- we've
> averaged about nine months since 1.0, with 2.1 taking an entire year.
> 
> We could make theory align with reality by acknowledging, "if nine months
> is our 'natural' release schedule, then so be it."  But I think we can do
> better.
> 
> Broadly speaking, we have two constituencies with Cassandra releases:
> 
> First, we have the users who are building or porting an application on
> Cassandra.  These users want the newest features to make their job easier.
> If 2.1.0 has a few bugs, it's not the end of the world.  They have time to
> wait for 2.1.x to stabilize while they write their code.  They would like
> to see us deliver on our six month schedule or even faster.
> 
> Second, we have the users who have an application in production.  These
> users, or their bosses, want Cassandra to be as stable as possible.
> Assuming they deploy on a stable release like 2.0.12, they don't want to
> touch it.  They would like to see us release *less* often.  (Because that
> means they have to do less upgrades while remaining in our backwards
> compatibility window.)
> 
> With our current "big release every X months" model, these users' needs are
> in tension.
> 
> We discussed this six months ago, and ended up with this:
> 
> What if we tried a [four month] release cycle, BUT we would guarantee that
>> you could do a rolling upgrade until we bump the supermajor version? So 2.0
>> could upgrade to 3.0 without having to go through 2.1.  (But to go to 3.1
>> or 4.0 you would have to go through 3.0.)
>> 
> 
> Crucially, I added
> 
> Whether this is reasonable depends on how fast we can stabilize releases.
>> 2.1.0 will be a good test of this.
>> 
> 
> Unfortunately, even after DataStax hired half a dozen full-time test
> engineers, 2.1.0 continued the proud tradition of being unready for
> production use, with "wait for .5 before upgrading" once again looking like
> a good guideline.
> 
> I’m starting to think that the entire model of “write a bunch of new
> features all at once and then try to stabilize it for release” is broken.
> We’ve been trying that for years and empirically speaking the evidence is
> that it just doesn’t work, either from a stability standpoint or even just
> shipping on time.
> 
> A big reason that it takes us so long to stabilize new releases now is
> that, because our major release cycle is so long, it’s super tempting to
> slip in “just one” new feature into bugfix releases, and I’m as guilty of
> that as anyone.
> 
> For similar reasons, it’s difficult to do a meaningful freeze with big
> feature releases.  A look at 3.0 shows why: we have 8099 coming, but we
> also have significant work done (but not finished) on 6230, 7970, 6696, and
> 6477, all of which are meaningful improvements that address demonstrated
> user pain.  So if we keep doing what we’ve been doing, our choices are to
> either delay 3.0 further while we finish and stabilize these, or we wait
> nine months to a year for the next release.  Either way, one of our
> constituencies gets disappointed.
> 
> So, I’d like to try something different.  I think we were on the right
> track with shorter releases with more compatibility.  But I’d like to throw
> in a twist.  Intel cuts down on risk with a “tick-tock” schedule for new
> architectures and process shrinks instead of trying to do both at once.  We
> can do something similar here:
> 
> One month releases.  Period.  If it’s not done, it can wait.
> *Every other release only accepts bug fixes.*
> 
> By itself, one-month releases are going to dramatically reduce the
> complexity of testing and debugging new releases -- and bugs that do slip
> past us will only affect a smaller percentage of users, avoiding the “big
> release has a bunch of bugs no one has seen before and pretty much everyone
> is hit by something” scenario.  But by adding in the second rule, I think
> we have a real chance to make a quantum leap here: stable, production-ready
> releases every two months.
> 
> So here is my proposal for 3.0:
> 
> We’re just about ready to start serious review of 8099.  When that’s done,
> we branch 3.0 and cut a beta and then release candidates.  Whatever isn’t
> done by then, has to wait; unlike prior betas, we will only accept bug
> fixes into 3.0 after branching.
> 
> One month after 3.0, we will ship 3.1 (with new features).  At the same
> time, we will branch 3.2.  New features in trunk will go into 3.3.  The 3.2
> branch will only get bug fixes.  We will maintain backwards compatibility
> for all of 3.x; eventually (no less than a year) we will pick a release to
> be 4.0, and drop deprecated features and old backwards compatibilities.
> Otherwise there will be nothing special about the 4.0 designation.  (Note
> that with an “odd releases have new features, even releases only have bug
> fixes” policy, 4.0 will actually be *more* stable than 3.11.)
> 
> Larger features can continue to be developed in separate branches, the way
> 8099 is being worked on today, and committed to trunk when ready.  So this
> is not saying that we are limited only to features we can build in a single
> month.
> 
> Some things will have to change with our dev process, for the better.  In
> particular, with one month to commit new features, we don’t have room for
> committing sloppy work and stabilizing it later.  Trunk has to be stable at
> all times.  I asked Ariel Weisberg to put together his thoughts separately
> on what worked for his team at VoltDB, and how we can apply that to
> Cassandra -- see his email from Friday <http://bit.ly/1MHaOKX>.  (TLDR:
> Redefine “done” to include automated tests.  Infrastructure to run tests
> against github branches before merging to trunk.  A new test harness for
> long-running regression tests.)
> 
> I’m optimistic that as we improve our process this way, our even releases
> will become increasingly stable.  If so, we can skip sub-minor releases
> (3.2.x) entirely, and focus on keeping the release train moving.  In the
> meantime, we will continue delivering 2.1.x stability releases.
> 
> This won’t be an entirely smooth transition.  In particular, you will have
> noticed that 3.1 will get more than a month’s worth of new features while
> we stabilize 3.0 as the last of the old way of doing things, so some
> patience is in order as we try this out.  By 3.4 and 3.6 later this year we
> should have a good idea if this is working, and we can make adjustments as
> warranted.
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder, http://www.datastax.com
> @spyced

—
Robert Stupp
@snazy


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message