cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Haddad <...@jonhaddad.com>
Subject Re: Proposal - 3.5.1
Date Thu, 15 Sep 2016 18:59:07 GMT
I don't think it's binary - we don't have to do year long insanity or
bleeding edge crazyness.

How about a release every 3 months, with each release accepting 6 months of
patches?  (oldstable & newstable)  Also provide nightly builds & stick to
the idea of stable trunk.

The issue is the number of bug fixes a given release gets.  1 bug fix
release for a new feature is just terrible.  The community as a whole
despises this system and is lowering confidence in the project.

Jon


On Thu, Sep 15, 2016 at 11:48 AM Jake Luciani <jakers@gmail.com> wrote:

> I'm pretty sure everyone will agree Tick-Tock didn't go well and needs to
> change.
>
> The problem for me is going back to the old way doesn't sound great. There
> are parts of tick-tock I really like,
> for example, the cadence and limited scope per release.
>
> I know at the summit there were a lot of ideas thrown around I can
> regurgitate but perhaps people
> who have been thinking about this would like to chime in and present ideas?
>
> -Jake
>
> On Thu, Sep 15, 2016 at 2:28 PM, Benedict Elliott Smith <
> benedict@apache.org
> > wrote:
>
> > I agree tick-tock is a failure.  But for two reasons IMO:
> >
> > 1) Ultimately, the users are the real testers and it takes a while for a
> > release to percolate into the wild for feedback.  The reality is that a
> > release doesn't have its tires properly kicked for at least three months
> > after it's cut.  So if we are to have any tocks, they should be
> completely
> > unwed from the ticks, and should probably happen on a ~3M cadence to keep
> > the labour down but the utility up (and there should probably still be
> more
> > than one tock per tick)
> >
> > 2) Those promised resources to improved process never happened.  We
> haven't
> > even reached parity with the 2.1 release until very recently, i.e. no
> > failing u/dtests.
> >
> >
> > On 15 September 2016 at 19:08, Jeff Jirsa <jeff.jirsa@crowdstrike.com>
> > wrote:
> >
> > > I know we’ve got a lot of folks following the dev list without a lot of
> > > background, so let’s make sure we get some context here so everyone can
> > be
> > > on the same page.
> > >
> > > Going to preface this wall of text by saying I’m +1 on a 3.5.1 (and
> > 3.3.1,
> > > etc) if it’s done AFTER 3.9 (I think we need to get 3.9 out first
> before
> > > the RE manpower is spent on backporting fixes, even critical fixes,
> > because
> > > 3.9 has multiple critical fixes for people running 3.7).
> > >
> > > Now some background:
> > >
> > > For many years, Cassandra used to have a dev process that kept 3 active
> > > branches - “bleeding edge”, a “stable”, and an “old stable” branch,
> where
> > > developers would be committing ALL new contributions to the bleeding
> > edge,
> > > non-api-breaking changes to stable, and bugfixes only to old stable.
> > While
> > > the api changed and major features were added, that bleeding edge would
> > > just be ‘trunk’, and it’d get cut into a major version when it was
> ready
> > to
> > > ship. We saw that with 2.2 / 2.1 / 2.0 (and before that, 2.1 / 2.0 /
> 1.2,
> > > and before that 2.0 / 1.2 / 1.1 ). When that bleeding edge got released
> > as
> > > a major x.y.0, the third, oldest, most stable branch went EOL, and new
> > > features would go into trunk for the next major version.
> > >
> > > There were two big negatives observed with this:
> > >
> > > The first big negative is that if multiple major new features were in
> > > flight, releases were prone to delay. Nobody wants to break an API on a
> > > x.y.1 release, and nobody wants to add a new feature to a x.y.2
> release,
> > so
> > > the project would delay the x.y releases if major features were close,
> > and
> > > then there’d be pressure to slip them in before they were fully tested,
> > or
> > > cut features to avoid delaying the release. This pressure was observed
> to
> > > be bad for the project – it forced technical compromises.
> > >
> > > The second downside that was observed was that nobody would try to run
> > the
> > > new versions when they launched, because they were buggy because they
> > were
> > > filled with new features. 2.2, for example, introduced RBAC, commitlog
> > > compression, and user defined functions – major features that needed to
> > be
> > > tested. Unfortunately, because there were few real-world testers, there
> > > were still major bugs being found for months – the first
> production-ready
> > > version of 2.2 is probably in the 2.2.5 or 2.2.6 range.
> > >
> > > For version 3, we moved to an alternate release, modeled on Intel’s
> > > tick/tock https://en.wikipedia.org/wiki/Tick-Tock_model
> > >
> > > The intention was to allow new features into 3.even releases (3.0, 3.2,
> > > 3.4, 3.6, and so on), with bugfixes in 3.odd releases (3.1, … ). The
> hope
> > > was to allow more frequent releases to address the first big negative
> > > (flood of new features that blocked releases), while also helping to
> > > address the second – with fewer major features in a release, they
> better
> > > get more/better test coverage.
> > >
> > > In the tick/tock model, anyone running 3.odd (like 3.5) should be
> looking
> > > for bugfixes in 3.7. It’s certainly true that 3.5 is horribly broken
> (as
> > is
> > > 3.3, and 3.4, etc), but with this release model, the bugfix SHOULD BE
> in
> > > 3.7. As I mentioned previously, we have precedent for backporting
> > critical
> > > fixes, but we don’t have a well defined bar (that I see) for what’s
> > > critical enough for a backport.
> > >
> > > Jon is noting (and what many of us who run Cassandra in production have
> > > really known for a very long time) is that nobody wants to run 3.newest
> > > (even or odd), because 3.newest is likely broken (because it’s a
> complex
> > > distributed database, and testing is hard, and it takes time and
> complex
> > > workloads to find bugs). In the tick/tock model, because new features
> > went
> > > into 3.6, there are new features that may not be adequately
> > > tested/validated in 3.7 a user of 3.5 doesn’t want, and isn’t willing
> to
> > > accept the risk.
> > >
> > > The bottom line here is that tick/tock is probably a well intentioned
> but
> > > failed attempt to bring stability to Cassandra’s releases. The problems
> > > tick/tock was meant to solve are real problems, but tick/tock doesn’t
> > seem
> > > to be addressing them – new features invalidate old testing, which
> makes
> > it
> > > difficult/impossible for real users to sit on the 3.odd versions.
> > >
> > > We’re due for cutting 3.9 and 3.0.9, and we have limited RE manpower to
> > > get those out. Only after those are out would I be +1 on a 3.5.1, and
> > then
> > > only because if I were running 3.5, and I hit this bug, I wouldn’t want
> > to
> > > spend the ~$100k it would cost my organization to validate 3.7 prior to
> > > upgrading, and I don’t think it’s reasonable to ask users to recompile
> a
> > > release for a ~10 line fix for a very nasty bug.
> > >
> > > I’m also very strongly recommend we (committers/PMC) reconsider
> tick/tock
> > > for 4.x releases, because this is exactly the type of problem that will
> > > continue to happen as we move forward. I suggest that we either need to
> > go
> > > back to the old model and do a better job of dealing with feature creep
> > and
> > > testing, or we need to better define what gets backported, because the
> > > community needs a stable version to run, and running latest odd release
> > of
> > > tick/tock isn’t it.
> > >
> > > - Jeff
> > >
> > >
> > > On 9/15/16, 10:31 AM, "dave_lester@apple.com on behalf of Dave
> Lester" <
> > > dave_lester@apple.com> wrote:
> > >
> > > >How would cutting a 3.5.1 release possibly confuse users of the
> > software?
> > > It would be easy to document the change and to send release notes.
> > > >
> > > >Given the bug’s critical nature and that it's a minor fix, I’m +1
> > > (non-binding) to a new release.
> > > >
> > > >Dave
> > > >
> > > >> On Sep 15, 2016, at 7:18 AM, Jeremiah D Jordan <https://urldefense.
> > >
> proofpoint.com/v2/url?u=http-3A__jeremiah.jordan-40gmail.com&d=DQIFaQ&c=
> > > 08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=
> > > yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
> > > srNzKwrs8hKPoJMZ4Ao18CYaMYKnbWaCHou6ui5tqdM&s=iM_
> > > LKKIhaiC0w6uz3lhK1lob4gJbKhLPqGNfPPLye6w&e= > wrote:
> > > >>
> > > >> I’m with Jeff on this, 3.7 (bug fixes on 3.6) has already been
> > released
> > > with the fix.  Since the fix applies cleanly anyone is free to put it
> on
> > > top of 3.5 on their own if they like, but I see no reason to put out a
> > > 3.5.1 right now and confuse people further.
> > > >>
> > > >> -Jeremiah
> > > >>
> > > >>
> > > >>> On Sep 15, 2016, at 9:07 AM, Jonathan Haddad <jon@jonhaddad.com>
> > > wrote:
> > > >>>
> > > >>> As I follow up, I suppose I'm only advocating for a fix to the
odd
> > > >>> releases.  Sadly, Tick Tock versioning is misleading.
> > > >>>
> > > >>> If tick tock were to continue (and I'm very much against how it
> > > currently
> > > >>> works) the whole even-features odd-fixes thing needs to stop ASAP,
> > all
> > > it
> > > >>> does it confuse people.
> > > >>>
> > > >>> The follow up to 3.4 (3.5) should have been 3.4.1, following
> semver,
> > so
> > > >>> people know it's bug fixes only to 3.4.
> > > >>>
> > > >>> Jon
> > > >>>
> > > >>> On Wed, Sep 14, 2016 at 10:37 PM Jonathan Haddad <
> jon@jonhaddad.com>
> > > wrote:
> > > >>>
> > > >>>> In this particular case, I'd say adding a bug fix release
for
> every
> > > >>>> version that's affected would be the right thing.  The issue
is so
> > > easily
> > > >>>> reproducible and will likely result in massive data loss for
> anyone
> > > on 3.X
> > > >>>> WHERE X < 6 and uses the "date" type.
> > > >>>>
> > > >>>> This is how easy it is to reproduce:
> > > >>>>
> > > >>>> 1. Start Cassandra 3.5
> > > >>>> 2. create KEYSPACE test WITH replication = {'class':
> > 'SimpleStrategy',
> > > >>>> 'replication_factor': 1};
> > > >>>> 3. use test;
> > > >>>> 4. create table fail (id int primary key, d date);
> > > >>>> 5. delete d from fail where id = 1;
> > > >>>> 6. Stop Cassandra
> > > >>>> 7. Start Cassandra
> > > >>>>
> > > >>>> You will get this, and startup will fail:
> > > >>>>
> > > >>>> ERROR 05:32:09 Exiting due to error while processing commit
log
> > during
> > > >>>> initialization.
> > > >>>> org.apache.cassandra.db.commitlog.CommitLogReplayer$
> > > CommitLogReplayException:
> > > >>>> Unexpected error deserializing mutation; saved to
> > > >>>> /var/folders/0l/g2p6cnyd5kx_1wkl83nd3y4r0000gn/T/
> > > mutation6313332720566971713dat.
> > > >>>> This may be caused by replaying a mutation against a table
with
> the
> > > same
> > > >>>> name but incompatible schema.  Exception follows:
> > > >>>> org.apache.cassandra.serializers.MarshalException: Expected
4 byte
> > > long for
> > > >>>> date (0)
> > > >>>>
> > > >>>> I mean.. come on.  It's an easy fix.  It cleanly merges against
> 3.5
> > > (and
> > > >>>> probably the other releases) and requires very little investment
> > from
> > > >>>> anyone.
> > > >>>>
> > > >>>>
> > > >>>> On Wed, Sep 14, 2016 at 9:40 PM Jeff Jirsa <
> > > jeff.jirsa@crowdstrike.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> We did 3.1.1 and 3.2.1, so there’s SOME precedent for
emergency
> > > fixes,
> > > >>>>> but we certainly didn’t/won’t go back and cut new
releases from
> > every
> > > >>>>> branch for every critical bug in future releases, so I
think we
> > need
> > > to
> > > >>>>> draw the line somewhere. If it’s fixed in 3.7 and 3.0.x
(x >= 6),
> > it
> > > seems
> > > >>>>> like you’ve got options (either stay on the tick and
go up to
> 3.7,
> > > or bail
> > > >>>>> down to 3.0.x)
> > > >>>>>
> > > >>>>> Perhaps, though, this highlights the fact that tick/tock
may not
> be
> > > the
> > > >>>>> best option long term. We’ve tried it for a year, perhaps
we
> should
> > > instead
> > > >>>>> discuss whether or not it should continue, or if there’s
another
> > > process
> > > >>>>> that gives us a better way to get useful patches into
versions
> > > people are
> > > >>>>> willing to run in production.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On 9/14/16, 8:55 PM, "Jonathan Haddad" <jon@jonhaddad.com>
> wrote:
> > > >>>>>
> > > >>>>>> Common sense is what prevents someone from upgrading
to yet
> > another
> > > >>>>>> completely unknown version with new features which
have probably
> > > broken
> > > >>>>>> even more stuff that nobody is aware of.  The folks
I'm helping
> > > right
> > > >>>>>> deployed 3.5 when they got started because
> > > >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__
> > > cassandra.apache.org&d=DQIBaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kq
> > > hAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
> > > MZ9nLcNNhQZkuXyH0NBbP1kSEE2M-SYgyVqZ88IJcXY&s=pLP3udocOcAG6k_
> > > sAb9p8tcAhtOhpFm6JB7owGhPQEs&e=
> > > >>>>> suggests
> > > >>>>>> it's acceptable for production.  It turns out using
4 of the
> built
> > > in
> > > >>>>>> datatypes of the database result in the server being
unable to
> > > restart
> > > >>>>>> without clearing out the commit logs and running a
repair.  That
> > > screams
> > > >>>>>> critical to me.  You shouldn't even be able to install
3.5
> without
> > > the
> > > >>>>>> patch I've supplied - that bug is a ticking time bomb
for anyone
> > > that
> > > >>>>>> installs it.
> > > >>>>>>
> > > >>>>>> On Wed, Sep 14, 2016 at 8:12 PM Michael Shuler <
> > > michael@pbandjelly.org>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> What's preventing the use of the 3.6 or 3.7 releases
where this
> > > bug is
> > > >>>>>>> already fixed? This is also fixed in the 3.0.6/7/8
releases.
> > > >>>>>>>
> > > >>>>>>> Michael
> > > >>>>>>>
> > > >>>>>>> On 09/14/2016 08:30 PM, Jonathan Haddad wrote:
> > > >>>>>>>> Unfortunately CASSANDRA-11618 was fixed in
3.6 but was not
> back
> > > >>>>> ported to
> > > >>>>>>>> 3.5 as well, and it makes Cassandra effectively
unusable if
> > > someone
> > > >>>>> is
> > > >>>>>>>> using any of the 4 types affected in any of
their schema.
> > > >>>>>>>>
> > > >>>>>>>> I have cherry picked & merged the patch
back to here and will
> > put
> > > it
> > > >>>>> in a
> > > >>>>>>>> JIRA as well tonight, I just wanted to get
the ball rolling
> asap
> > > on
> > > >>>>> this.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.
> > > com_rustyrazorblade_cassandra_tree_fix-5Fcommitlog-
> > 5Fexception&d=DQIBaQ&c=
> > > 08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=
> > > yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
> > > MZ9nLcNNhQZkuXyH0NBbP1kSEE2M-SYgyVqZ88IJcXY&s=ktY5tkT-
> > > nO1jtyc0EicbgZHXJYl03DvzuxqzyyOgzII&e=
> > > >>>>>>>>
> > > >>>>>>>> Jon
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > > >
> > >
> >
>
>
>
> --
> http://twitter.com/tjake
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message