incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Robinson <he...@apache.org>
Subject Re: Impala commit policy
Date Thu, 03 Dec 2015 18:20:23 GMT
On 2 December 2015 at 23:04, Greg Stein <gstein@gmail.com> wrote:

> On Wed, Dec 2, 2015 at 8:50 PM, Julian Hyde <jhyde@apache.org> wrote:
>
> > Thanks, Roman. For the record, I don’t plan to contribute to Impala or
> > Kudu, and I don’t like strict commit policies such as RTC. But I wanted
> to
> > stand up for “states' rights”, the right of podlings and projects to
> > determine their own processes and cultures.
> >
>
> LOL ... being a Texan, I can certainly get on board with the notion of
> states' rights :-P
>
> But I caution: as I said else-thread, we use the Incubation process because
> we believe the podling needs to *learn* how we like communities to operate.
> Peer respect, inclusivity, open dialog, consensus, etc. By definition, the
> podling is unable to make these decisions within the guides and desires of
> the Foundation. If we trusted them to do so, then we'd just make them a TLP
> and skip incubation.
>
> Josh puts it well:
>
> On Thu, Dec 3, 2015 at 12:26 AM, Josh Elser <elserj@apache.org> wrote:
> >...
>
> > +1 I'm not entirely sold on saying they have no explicitly policy up
> front
> > (I'd be worried about that causing confusion -- the project will operate
> > how they're comfortable operating), but I'd definitely want to see _real_
> > discussion is had after the podling gets on its feet and grows beyond the
> > initial membership.
> >
>
> I'd like to see podlings have enough diversity and independence from the
> initial PPMC, to have such a discussion. My fear is that RTC holds back
> growing the diversity of opinion, and that status quo will not allow for
> moving away from Gerrit.
>
> ...
>
> I will also note that one of the primary reasons explained for RTC is "the
> code is too complex to allow for unreviewed changes to be applied". Has
> that basis been justified for Impala? Are we talking data loss? Nope. It's
> a layer over the data substrate. Synchronization across a cluster? Nah.
> Where is the complexity?
>

I'm happy to field technical questions about Impala. You seem to be
conflating 'complexity' with 'severity of potential bugs' - I see the two
as separate.

Under the 'severity' heading, Impala both writes and reads data from a
variety of data stores. So if there's a bug in Impala's write path, data
can be lost. But because Impala also returns results to client
applications, there's a significant risk of business impact if the *wrong*
results are returned. I know, because I have dealt with situations where
this has happened, and no-one is very happy about it. Our customers
typically run business-critical analytic workloads through Impala; if it
stops working correctly that's usually a big problem.

As far as 'complexity' goes, I make no comparative claims about Impala's
complexity vs any other project. But to give some indication of the moving
parts inside Impala: there's a component which compiles highly optimised
versions of each query operator at run time, there's a query planner which
parses and plans a large portion of the SQL standard, there is the added
complexity of being a 'massively' (with many deployments in the high 100s
of nodes) distributed system with the added coordination and consistency
guarantees that brings to it, and there is also the added complexity of
running highly concurrent workloads in a single process, with all the
concurrency headaches etc. that can bring. That's not to mention
implementations of 'standard' SQL operators like joins, sorts and so on
that are still the subject of active research in academia and industry.

All this is in the context of Impala's main differentiator, which is that
it is amongst the very fastest SQL engine for data stored in HDFS and
friends. That means that small changes can have large unexpected
consequences, since efficiency is a subtle and capricious thing. It has
always, therefore, helped us to have more than one set of eyes on every
change in the past, to ensure that the probability of the introduction of
subtle performance and functional regressions is reduced. Automated testing
plays a huge role here as well, but for us it's been most effective in
concert with code review.

(There are other reasons I vastly prefer RTC as well, but I'm addressing
your specific points here so as not to kick off another RTCvsCTR thread :)).



>
> In this case, the RTC seems to stem from the choice of Gerrit, rather than
> some innate complexity.
>
>

Gerrit does not mandate RTC, since you can just push to refs/heads/<branch>
and bypass the review creation step.

Historically, the Impala team at Cloudera has used at least three different
review tools (including Review Board, which is used elsewhere at the ASF).
The choice of review tool stems completely from pragmatism - we really did
not like Review Board, and briefly used Rietveld before moving to Gerrit
which we have preferred. At every step, we used RTC.

Henry



> I *do* note that possibly committers could choose to commit directly, or
> choose to use Gerrit when they are unsure. Will the (P)PMC allow those
> direct commits? Or mandate Gerrit for every commit?
>
> Cheers,
> -g
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message