hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Tez branch and tez based patches
Date Tue, 30 Jul 2013 04:02:42 GMT
At ~25:00

"There is a working prototype of hive which is using tez as the targeted
runtime"

Can I get a look at that code? Is it on github?

Edward


On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates <gates@hortonworks.com> wrote:

> Answers to some of your questions inlined.
>
> Alan.
>
> On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote:
>
> > There are some points I want to bring up. First, I am on the PMC. Here is
> > something I find relevant:
> >
> > http://www.apache.org/foundation/how-it-works.html
> >
> > ------------------------------
> >
> > The role of the PMC from a Foundation perspective is oversight. The main
> > role of the PMC is not code and not coding - but to ensure that all legal
> > issues are addressed, that procedure is followed, and that each and every
> > release is the product of the community as a whole. That is key to our
> > litigation protection mechanisms.
> >
> > Secondly the role of the PMC is to further the long term development and
> > health of the community as a whole, and to ensure that balanced and wide
> > scale peer review and collaboration does happen. Within the ASF we worry
> > about any community which centers around a few individuals who are
> working
> > virtually uncontested. We believe that this is detrimental to quality,
> > stability, and robustness of both code and long term social structures.
> >
> > --------------------------------
> >
> >
> https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different
> >
> > -------------------------------------
> >
> > All other decisions happen on the dev list, discussions on the private
> list
> > are kept to a minimum.
> >
> > "If it didn't happen on the dev list, it didn't happen" - which leads to:
> >
> > a) Elections of committers and PMC members are published on the dev list
> > once finalized.
> >
> > b) Out-of-band discussions (IRC etc.) are summarized on the dev list as
> > soon as they have impact on the project, code or community.
> > ---------------------------------
> >
> > https://issues.apache.org/jira/browse/HIVE-4660 ironically titled "Let
> > their be Tez" has not be +1 ed by any committer. It was never discussed
> on
> > the dev or the user list (as far as I can tell).
>
> As all JIRA creations and updates are sent to dev@hive, creating a JIRA
> is de facto posting to the list.
>
> >
> > As a PMC member I feel we need more discussion on Tez on the dev list
> along
> > with a wiki-fied design document. Topics of discussion should include:
>
> I talked with Gunther and he's working on posting a design doc on the
> wiki.  He has a PDF on the JIRA but he doesn't have write permissions yet
> on the wiki.
>
> >
> > 1) What is tez?
> In Hadoop 2.0, YARN opens up the ability to have multiple execution
> frameworks in Hadoop.  Hadoop apps are no longer tied to MapReduce as the
> only execution option.  Tez is an effort to build an execution engine that
> is optimized for relational data processing, such as Hive and Pig.
>
> The biggest change here is to move away from only Map and Reduce as
> processing options and to allow alternate combinations of processing, such
> as map -> reduce -> reduce or tasks that take multiple inputs or shuffles
> that avoid sorting when it isn't needed.
>
> For a good intro to Tez, see Arun's presentation on it at the recent
> Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides
> http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212)
> >
> > 2) How is tez different from oozie, http://code.google.com/p/hop/,
> > http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming
> map
> > reduce tools/frameworks? Why should we use this and not those?
>
> Oozie is a completely different thing.  Oozie is a workflow engine and a
> scheduler.  It's core competencies are the ability to coordinate workflows
> of disparate job types (MR, Pig, Hive, etc.) and to schedule them.  It is
> not intended as an execution engine for apps such as Pig and Hive.
>
> I am not familiar with these other engines, but the short answer is that
> Tez is built to work on YARN, which works well for Hive since it is tied to
> Hadoop.
> >
> > 3) When can we expect the first tez release?
> I don't know, but I hope sometime this fall.
>
> >
> > 4) How much effort is involved in integrating hive and tez?
> Covered in the design doc.
>
> >
> > 5) Who is ready to commit to this effort?
> I'll let people speak for themselves on that one.
>
> >
> > 6) can we expect this work to be done in one hive release?
> Unlikely.  Initial integration will be done in one release, but as Tez is
> a new project I expect it will be adding features in the future that Hive
> will want to take advantage of.
>
> >
> > In my opinion we should not start any work on this tez-hive until these
> > questions are answered to the satisfaction of the hive developers.
>
> Can we change this to "not commit patches"?  We can't tell willing people
> not to work on it.
> >
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo <edlinuxguru@gmail.com
> >wrote:
> >
> >>
> >>>> The Hive bylaws,
> >> https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out what
> >> votes are needed for what.  I don't see anything there about needing 3
> +1s
> >> for a branch.  Branching >>would seem to fall under code change, which
> >> requires one vote and a minimum length of 1 day.
> >>
> >> You could argue that all you need is one +1 to create a branch, but this
> >> is more then a branch. If you are talking about something that is:
> >> 1) going to cause major re-factoring of critical pieces of hive like
> >> ExecDriver and MapRedTask
> >> 2) going to be very disruptive to the efforts of other committers
> >> 3) something that may be a major architectural change
> >>
> >> Getting the project on board with the idea is a good idea.
> >>
> >> Now I want to point something out. Here are some recent initiatives in
> >> hive:
> >>
> >> 1) At one point there was a big initiative to "support oracle" after the
> >> initial work, there are patches in Jira no one seems to care about
> oracle
> >> support.
> >> 2) Another such decisions was this "support windows" one, there are
> >> probably 4 windows patches waiting reviews.
> >> 3) I still have no clue what the official hadoop1 hadoop2, hadoop 0.23
> >> support prospective is, but every couple weeks we get another jira about
> >> something not working/testing on one of those versions, seems like
> several
> >> builds are broken.
> >> 4) Hive-storage handler, after the initial implementation no one cares
> to
> >> review any other storage handler implementation, 3 patches there or
> more,
> >> could not even find anyone willing to review the cassandra storage
> handler
> >> I spent months on.
> >> 5) OCR, Vectorization
> >> 6) Windowing: committed, numerous check-style violations.
> >>
> >> We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active committers.
> We
> >> are spread very thin, and embarking on another side project not involved
> >> with core hive seems like the wrong direction at the moment.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates <gates@hortonworks.com>
> wrote:
> >>
> >>>
> >>> On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote:
> >>>
> >>>> I have started to see several re factoring patches around tez.
> >>>> https://issues.apache.org/jira/browse/HIVE-4843
> >>>>
> >>>> This is the only mention on the hive list I can find with tez:
> >>>> "Makes sense. I will create the branch soon.
> >>>>
> >>>> Thanks,
> >>>> Ashutosh
> >>>>
> >>>>
> >>>> On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner <
> >>>> ghagleitner@hortonworks.com> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am starting to work on integrating Tez into Hive (see HIVE-4660,
> >>> design
> >>>>> doc has already been uploaded - any feedback will be much
> appreciated).
> >>>>> This will be a fair amount of work that will take time to
> >>> stabilize/test.
> >>>>> I'd like to propose creating a branch in order to be able to do
this
> >>>>> incrementally and collaboratively. In order to progress rapidly
with
> >>> this,
> >>>>> I would also like to go "commit-then-review".
> >>>>>
> >>>>> Thanks,
> >>>>> Gunther.
> >>>>> "
> >>>>
> >>>> These refactor-ings are largely destructive to a number of bugs and
> >>>> language improvements in hive.The language improvements and bug fixes
> >>> that
> >>>> have been sitting in Jira for quite some time now marked
> patch-available
> >>>> and are waiting for review.
> >>>>
> >>>> There are a few things I want to point out:
> >>>> 1) Normally we create design docs in out wiki (which it is not)
> >>>> 2) Normally when the change is significantly complex we get multiple
> >>>> committers to comment on it (which we did not)
> >>>> On point 2 no one -1  the branch, but this is really something that
> >>> should
> >>>> have required a +1 from 3 committers.
> >>>
> >>> The Hive bylaws,
> https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what
> votes are needed for what.  I don't see anything there about
> >>> needing 3 +1s for a branch.  Branching would seem to fall under code
> >>> change, which requires one vote and a minimum length of 1 day.
> >>>
> >>>>
> >>>> I for one am not completely sold on Tez.
> >>>> http://incubator.apache.org/projects/tez.html.
> >>>> "directed-acyclic-graph of tasks for processing data" this description
> >>>> sounds like many things which have never become popular. One to think
> >>> of is
> >>>> oozie "Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of
> >>>> actions.". I am sure I can find a number of libraries/frameworks that
> >>> make
> >>>> this same claim. In general I do not feel like we have done our
> homework
> >>>> and pre-requisites to justify all this work. If we have done the
> >>> homework,
> >>>> I am sure that it has not been communicated and accepted by hive
> >>> developers
> >>>> at large.
> >>>
> >>> A request for better documentation on Tez and a project road map seems
> >>> totally reasonable.
> >>>
> >>>>
> >>>> If we have a branch, why are we also committing on trunk? Scanning
> >>> through
> >>>> the tez doc the only language I keep finding language like "minimal
> >>> changes
> >>>> to the planner" yet, there is ALREADY lots of large changes going on!
> >>>>
> >>>> Really none of the above would bother me accept for the fact that
> these
> >>>> "minimal changes" are causing many "patch available" ready-for-review
> >>> bugs
> >>>> and core hive features to need to be re based.
> >>>>
> >>>> I am sure I have mentioned this before, but I have to spend 12+ hours
> to
> >>>> test a single patch on my laptop. A few days ago I was testing a new
> >>> core
> >>>> hive feature. After all the tests passed and before I was able to
> >>> commit,
> >>>> someone unleashed a tez patch on trunk which caused the thing I was
> >>> testing
> >>>> for 12 hours to need to be rebased.
> >>>>
> >>>>
> >>>> I'm not cool with this.Next time that happens to me I will seriously
> >>>> consider reverting the patch. Bug fixes and new hive features are more
> >>>> important to me then integrating with incubator projects.
> >>>
> >>> (With my Apache member hat on)  Reverting patches that aren't breaking
> >>> the build is considered very bad form in Apache.  It does make sense to
> >>> request that when people are going to commit a patch that will break
> many
> >>> other patches they first give a few hours of notice so people can say
> >>> something if they're about to commit another patch and avoid your fate
> of
> >>> needing to rerun the tests.  The other thing is we need to get get the
> >>> automated build of patches working on Hive so committers are forced to
> run
> >>> all of the tests themselves.  We are working on it, but we're not
> there yet.
> >>>
> >>> Alan.
> >>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message