hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: [DISCUSS] Supporting Hadoop-1 and experimental features
Date Mon, 25 May 2015 05:50:23 GMT
"Same goes for stuff like MR; supporting it, esp. for perf work, becomes a
burden, and it’s outdated with 2 alternatives, one of which has been
around for 2 releases."

I am not trying to pick on your words here but I want to acknowledge
something.

"Been around for 2 releases" means less to people than you would think.
Many of users are locked in by when the distribution chooses to cut a
release. Now as it turns outs there are two major distributions, one
distribution does pretty much nothing to support tez. Here is what "around
for two releases" means for a CDH user:

http://search-hadoop.com/m/8er9RFVSf2&subj=Re+Getting+Tez+working+against+cdh+5+3

After much hacking with a rather new CDH version I was actually unable to
get the alternative running.

The other alternative, which I am presuming, to mean hive-on-spark probably
has not shipped in many distributions either. I do not think either
"alternative" has much real world battlefield experience.

The reality is a normal user has to test a series of processes before they
can pull the trigger on an upgrade. For example, I used to work at a adtech
company. Hive added a feature called "Exchange partitions".Tthis actually
broke a number of our processes because we use the word "exchange" all the
time.It became a keyword many of our scripts broke. This is not a fault of
hive or the feature, but is is just a fact that no one wants to touch test
big lumbering ETL proceses (even with lightning fast sexy engines) five
times a year.

I mentioned this before but I want to repeat. Hive was "releasable trunk"
for a long time and it served users well. We never had 2-4 feature
branches. One binary dropped ontop of hadoop 17, 20, 21, 203 and 2.0. If we
get in a situation where all the "old users" "don't care about new
features" we can easily land in a situation where are actual users are
running the "old" hadoop unable to upgrade to the "hive with the new
features" because it requires dependencies < 2 months old not ported to
their distribution yet. As a user I am already starting to see this where
the distributions behind hive because a point upgrade is not compelling for
the distributor.

On Fri, May 22, 2015 at 4:19 PM, Alan Gates <alanfgates@gmail.com> wrote:

> I agree with *All* features with the exception that some features might be
> branch-1 specific (if it's a feature on something no longer supported in
> master, like hadoop-1).  Without this we prevent new features for older
> technology, which doesn't strike me as reasonable.
>
> I see your point on saying the contributor may not understand where best
> to put the patch, and thus the committer decides.  However, it would be
> very disappointing for a contributor who uses branch-1 to build a new
> feature only to have the committer put it only in master.  So I would
> modify your modification to say "at the discretion of the contributor and
> Hive committers".
>
> Alan.
>
>   kulkarni.swarnim@gmail.com
>  May 22, 2015 at 11:41
> +1 on the new proposal. Feedback below:
>
> > New features must be put into master.  Whether to put them into
> branch-1 is at the discretion of the developer.
>
> How about we change this to "*All* features must be put into master.
> Whether to put them into branch-1 is at the discretion of the *committer*."
> The reason I think is going forward for us to sustain as a happy and
> healthy community, it's imperative for us to make it not only easy for the
> users, but also for developers and committers to contribute/commit patches.
> To me being a hive contributor would be hard to determine which branch my
> code belongs. Also IMO(and I might be wrong) but many committers have their
> own areas of expertise and it's also very hard for them to immediately
> determine what branch a patch should go to unless very well documented
> somewhere. Putting all code into the master would be an easy approach to
> follow and then cherry picking to other branches can be done. So even if
> people forget to do that, we can always go back to master and port the
> patches out to these branches. So we have a master branch, a branch-1 for
> stable code, branch-2 for experimental and "bleeding edge" code and so on.
> Once branch-2 is stable, we deprecate branch-1, create branch-3 and move on.
>
> Another reason I say this is because in my experience, a pretty
> significant amount of work is hive is still bug fixes and I think that is
> what the user cares most about(correctness above anything else). So with
> this approach, might be very obvious to what branches to commit this to.
>
>
>
>
> --
> Swarnim
>    Chris Drome <cdrome@yahoo-inc.com.INVALID>
>  May 22, 2015 at 0:49
> I understand the motivation and benefits of creating a branch-2 where more
> disruptive work can go on without affecting branch-1. While not necessarily
> against this approach, from Yahoo's standpoint, I do have some questions
> (concerns).
> Upgrading to a new version of Hive requires a significant commitment of
> time and resources to stabilize and certify a build for deployment to our
> clusters. Given the size of our clusters and scale of datasets, we have to
> be particularly careful about adopting new functionality. However, at the
> same time we are interested in new testing and making available new
> features and functionality. That said, we would have to rely on branch-1
> for the immediate future.
> One concern is that branch-1 would be left to stagnate, at which point
> there would be no option but for users to move to branch-2 as branch-1
> would be effectively end-of-lifed. I'm not sure how long this would take,
> but it would eventually happen as a direct result of the very reason for
> creating branch-2.
> A related concern is how disruptive the code changes will be in branch-2.
> I imagine that changes in early in branch-2 will be easy to backport to
> branch-1, while this effort will become more difficult, if not impractical,
> as time goes. If the code bases diverge too much then this could lead to
> more pressure for users of branch-1 to add features just to branch-1, which
> has been mentioned as undesirable. By the same token, backporting any code
> in branch-2 will require an increasing amount of effort, which contributors
> to branch-2 may not be interested in committing to.
> These questions affect us directly because, while we require a certain
> amount of stability, we also like to pull in new functionality that will be
> of value to our users. For example, our current 0.13 release is probably
> closer to 0.14 at this point. Given the lifespan of a release, it is often
> more palatable to backport features and bugfixes than to jump to a new
> version.
>
> The good thing about this proposal is the opportunity to evaluate and
> clean up alot of the old code.
> Thanks,
> chris
>
>
>
> On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin
> <sergey@hortonworks.com> <sergey@hortonworks.com> wrote:
>
>
> Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
> people are set in their ways or have practical considerations and don’t
> care for new shiny stuff.
>
>
>
>
>
>   Sergey Shelukhin <sergey@hortonworks.com>
>  May 18, 2015 at 11:47
> Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
> people are set in their ways or have practical considerations and don’t
> care for new shiny stuff.
>
>
>   Sergey Shelukhin <sergey@hortonworks.com>
>  May 18, 2015 at 11:46
> I think we need some path for deprecating old Hadoop versions, the same
> way we deprecate old Java version support or old RDBMS version support.
> At some point the cost of supporting Hadoop 1 exceeds the benefit. Same
> goes for stuff like MR; supporting it, esp. for perf work, becomes a
> burden, and it’s outdated with 2 alternatives, one of which has been
> around for 2 releases.
> The branches are a graceful way to get rid of the legacy burden.
>
> Alternatively, when sweeping changes are made, we can do what Hbase did
> (which is not pretty imho), where 0.94 version had ~30 dot releases
> because people cannot upgrade to 0.96 “singularity” release.
>
>
> I posit that people who run Hadoop 1 and MR at this day and age (and more
> so as time passes) are people who either don’t care about perf and new
> features, only stability; so, stability-focused branch would be perfect to
> support them.
>
>
>
>   Edward Capriolo <edlinuxguru@gmail.com>
>  May 18, 2015 at 10:04
> Up until recently Hive supported numerous versions of Hadoop code base with
> a simple shim layer. I would rather we stick to the shim layer. I think
> this was easily the best part about hive was that a single release worked
> well regardless of your hadoop version. It was also a key element to hive's
> success. I do not want to see us have multiple branches.
>
>
>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message