bigtop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Boudnik <...@apache.org>
Subject Re: What will the next generation of bigtop look like?
Date Tue, 23 Dec 2014 08:25:47 GMT
I want to agree with Andrew. While Spark is a huge step forward compare to
basic Hadoop it isn't a solution for everything and definitely isn't a
solution for fast processing of data sets that don't fit the memory. Oh, and
by the way let's not forget about the fact that ML/analytics on Hadoop isn't
the whole world of data processing. Say OLTP workloads command a way larger
market share that just ML. That's why I am very optimistic about Ignite
(incubating).

Cos

On Thu, Dec 11, 2014 at 02:04PM, Andrew Purtell wrote:
> The problem I see with a Spark-only stack is, in my experience, Spark falls
> apart as soon as the working set exceeds all available RAM on the cluster.
> (One is presented with a sea of exceptions.) We need Hadoop anyway for HDFS
> and Common (required by many many components), we get YARN and the MR
> runtime as part of this package, and Hadoop MR is still eminently useful
> when data sets and storage requirements are far beyond agg RAM.
> 
> We have an open JIRA for adding Kafka, it would be fantastic if someone
> picks it up and brings it over the finish line.
> 
> 
> On Thu, Dec 11, 2014 at 10:14 AM, RJ Nowling <rnowling@gmail.com> wrote:
> 
> > GraphX, Streaming, MLlib, and Spark SQL are all part of Spark and would be
> > included in BigTop if Spark is included. They're also pretty well
> > integrated with each other.
> >
> > I'd like to throw out a radical idea, based on Andrew's comments: focus on
> > the vertical rather than the horizontal with a slimmed down, Spark-oriented
> > stack.  (This could be a subset of the current stack.)  Strat.io's work
> > provides a nice example of a pure Spark stack.
> >
> > Spark offers a smaller footprint, far less maintenance, functionality of
> > many Hadoop components in one (and better integration!), and is better
> > suited for diverse deployment situations (cloud, non-HDFS storage, etc.)
> >
> > A few other complementary components would be needed: Kafka would be
> > needed for HA with Spark streaming.  Tachyon.  Maybe offer Cassandra or
> > similar as an alternative storage option.    Combine this with dashboards
> > and visualization and high quality deployment options (Puppet, Docker,
> > etc.).  With the data generator and Spark implementation of BigPetStore, my
> > goal is to to expand BPS to provide high quality analytics examples,
> > oriented more towards data scientists.
> >
> > Just a thought...
> >
> > On Thu, Dec 11, 2014 at 12:39 PM, Andrew Purtell <apurtell@apache.org>
> > wrote:
> >
> >> This is a really great post and I was nodding along with most of it.
> >>
> >> My personal view is Bigtop starts as a deployable stack of Apache
> >> ecosystem components for Big Data. Commodification of (Linux) deployable
> >> packages and basic install integration is the baseline.
> >>
> >> Bigtop packaging Spark components first is an unfortunately little known
> >> win of this community, but its still a win. Although replicating that
> >> success with choice of the 'next big thing' is going to be a hit or miss
> >> proposition unless one of us can figure out time travel, definitely we can
> >> make some observations and scour and/or influence the Apache project
> >> landscape to pick up coverage in the space:
> >>
> >> - Storage is commoditized. Nearly everyone bases the storage stack on
> >> HDFS. Everyone does so with what we'd call HCFS. Best to focus elsewhere.
> >>
> >> - Packaging is commoditized. It's a shame that vendors pursue misguided
> >> lock-in strategies but we have no control over that. It's still true that
> >> someone using HDP or CDH 4 can switch to Bigtop and vice versa without
> >> changing package management tools or strategy. As a user of Apache stack
> >> technologies I want long term sustainable package management so will vote
> >> with my feet for the commodity option, and won't be alone. Bigtop should
> >> provide this, and does, and it's mostly a solved problem.
> >>
> >> - Deployment is also a "solved" problem but unfortunately everyone solves
> >> it differently. :-) This is an area where Bigtop can provide real value,
> >> and does, with the Puppet scripts, with the containerization work. One
> >> function Bigtop can serve is as repository and example of Hadoop-ish
> >> production tooling.
> >>
> >> - YARN is a reasonably generic grid resource manager. We don't have the
> >> resources to stand up an alternate RM and all the tooling necessary with
> >> Mesos, but if Mesosphere made a contribution of that I suspect we'd take
> >> it. From the Bigtop perspective I think computation framework options are
> >> well handled, in that I don't see Bigtop or anyone else developing credible
> >> alternatives to MR and Spark for some time. Not sure there's enough oxygen.
> >> And we have Giraph (and is GraphX packaged with Spark?). To the extent
> >> Spark-on-YARN has rough edges in the Bigtop framework that's an area where
> >> contributors can produce value. Related, support for Hive on Spark, Pig on
> >> Spark (spork).
> >>
> >> - The Apache stack includes three streaming computation frameworks -
> >> Storm, Spark Streaming, Samza - but Bigtop has mostly missed the boat here.
> >> Spark streaming is included in the spark package (I think) but how well is
> >> it integrated? Samza is well integrated with YARN but we don't package it.
> >> There's also been Storm-on-YARN work out of Yahoo, not sure about what was
> >> upstreamed or might be available. Anyway, integration of stream computation
> >> frameworks into Bigtop's packaging and deployment/management scripts can
> >> produce value, especially if we provide multiple options, because vendors
> >> are choosing favorites.
> >>
> >> - Data access. We do have players differentiating themselves here. Bigtop
> >> provides two SQL options (Hive, Phoenix+HBase), can add a third, I see
> >> someone's proposed Presto packaging. I'm not sure from the Bigtop
> >> perspective we need to pursue additional alternatives, but if there were
> >> contributions, we might well take them. "Enterprise friendly API" (SQL) is
> >> half of the data access picture I think, the other half is access control.
> >> There are competing projects in incubation, Sentry and Ranger, with no
> >> shared purpose, which is a real shame. To the extent that Bigtop adopts a
> >> cross-component full-stack access control technology, or helps bring
> >> another alternative into incubation and adopts that, we can move the needle
> >> in this space. We'd offer a vendor neutral access control option devoid of
> >> lock-in risk, this would be a big deal for big-E enterprises.
> >>
> >> - Data management and provenance. Now we're moving up the value chain
> >> from storage and data access to the next layer. This is mostly greenfield /
> >> blue ocean space in the Apache stack. We have interesting options in
> >> incubation: Falcon, Taverna, NiFi. (I think the last one might be truly
> >> comprehensive.) All of these are higher level data management and
> >> processing workflows which include aspects of management and provenance.
> >> One or more could be adopted and refined. There are a lot of relevant
> >> integration opportunities up and down the stack that could be undertaken
> >> with shared effort of the Bigtop, framework, and component communities.
> >>
> >> - Machine learning. Moving further up the value chain, we have data and
> >> computation and workflow, now how do we derive the competitive advantage
> >> that all of the lower layer technologies are in place for? The new hotness
> >> is surfacing of insights out of scaled parallel statistical inference.
> >> Unfortunately this space doesn't present itself well to the toolbox
> >> approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they
> >> themselves are toolkits with components of varying utility and maturity
> >> (and relevance). I think Bigtop could provide some value by curating ML
> >> frameworks that tie in with other Apache stack technologies. ML toolkits
> >> leave would-be users in the cold. One has to know what one is doing, and
> >> what to do is highly use case specific, this is why "data scientists" can
> >> command obscene salaries and only commercial vendors have the resources to
> >> focus on specific verticals.
> >>
> >> - Visualization and preparation. Moving further up, now we are almost
> >> touching directly the use case. We have data but we need to clean it,
> >> normalize, regularize, filter, slice and dice. Where there are reasonably
> >> generic open source tools, preferably at Apache, for data preparation and
> >> cleaning Bigtop could provide baseline value by packaging it, and
> >> additional value with deeper integration with Apache stack components. Data
> >> preparation is a concern hand in hand with data ingest, so we have an
> >> interesting feedback loop from the top back down to ingest tools/building
> >> blocks like Kafka and Flume. Data cleaning concerns might overlap with the
> >> workflow frameworks too. If there's a friendly licensed open source
> >> graphical front end to the data cleaning/munging/exploration process that
> >> is generic enough that would be a really interesting "acquisition".
> >> - We can also package visualization libraries and toolkits for building
> >> dashboards. Like with ML algorithms, a complete integration is probably out
> >> of scope because every instance would be use case and user specific.
> >>
> >>
> >>
> >> On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <cos@apache.org>
> >> wrote:
> >>
> >>> First I want to address the RJ's question:
> >>>
> >>> The most prominent downstream Bigtop Dependency would be any commercial
> >>> Hadoop distribution like HDP and CDH. The former is trying to
> >>> disguise their affiliation by pushing Ambari forward, and Cloudera's
> >>> seemingly
> >>> shifting her focus to compressed tarballs media (aka parcels) which
> >>> requires
> >>> a closed-source solutions like Cloudera Manager to deploy and control
> >>> your
> >>> cluster, effectively rendering it useless if you ever decide to
> >>> uninstall the
> >>> control software. In the interest of full disclosure, I don't think
> >>> parcels
> >>> have any chance to landslide the consensus in the industry from Linux
> >>> packaging towards something so obscure and proprietary as parcels are.
> >>>
> >>>
> >>> And now to my actual points....:
> >>>
> >>> I do strongly believe the Bigtop was and is the only completely
> >>> transparent,
> >>> vendors' friendly, and 100% sticking to official ASF product releases
> >>> way of
> >>> building your stack from ground up, deploying and controlling it anyway
> >>> you
> >>> want to. I agree with Roman's presentation on how this project can move
> >>> forward. However, I somewhat disagree with his view on the perspectives.
> >>> It
> >>> might be a hard road to drive the opinion of the community.  But, it is
> >>> a high
> >>> road.
> >>>
> >>> We are definitely small and mostly unsupported by commercial groups that
> >>> are
> >>> using the framework. Being a box of LEGO won't win us anything. If
> >>> anything,
> >>> the empirical evidences are against it as commercial distros have
> >>> decided to
> >>> move towards their own means of "vendor lock-in" (yes, you hear me
> >>> right - that's exactly what I said: all so called open-source companies
> >>> have
> >>> invented a way to lock-in their customers either with fancy "enterprise
> >>> features" that aren't adding but amending underlying stack; or with
> >>> custom set
> >>> of patches oftentimes rendering the cluster to become incompatible
> >>> between
> >>> different vendors).
> >>>
> >>> By all means, my money are on the second way, yet slightly modified (as
> >>> use-cases are coming from users, not developers):
> >>>   #2 start driving adoption of software stacks for the particular kind
> >>> of data workloads
> >>>
> >>> This community has enough day-to-day practitioners on board to
> >>> accumulate a near-complete introspection of where the technology is
> >>> moving.
> >>> And instead of wobbling in a backwash, let's see if we can be smart and
> >>> define
> >>> this landscape. After all, Bigtop has adopted Spark well before any of
> >>> the
> >>> commercials have officially accepted it. We seemingly are moving more and
> >>> more into in-memory realm of data processing: Apache Ignite (Gridgain),
> >>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am
> >>> doubtful,
> >>> that it can walk for much longer... May be it's just me.
> >>>
> >>> In this thread http://is.gd/MV2BH9 we already discussed some of the
> >>> aspects
> >>> influencing the feature of this project. And we are de-facto working on
> >>> the
> >>> implementation. In my opinion, Hadoop has been more or less commoditized
> >>> already. And it isn't a bad thing, but it means that the innovations are
> >>> elsewhere. E.g. Spark moving is moving beyond its ties with storage
> >>> layer via
> >>> Tachyon abstraction; GridGain simply doesn't care what's underlying
> >>> storage
> >>> is. However, data needs to be stored somewhere before it can be
> >>> processed. And
> >>> HCFS seems to be fitting the bill ok. But, as I said already, I see the
> >>> real
> >>> action elsewhere. If I were to define the shape of our mid- to long'ish
> >>> term
> >>> roadmap it'd be something like that:
> >>>
> >>>             ^   Dashboard/Visualization  ^
> >>>             |     OLTP/ML processing     |
> >>>             |    Caching/Acceleration    |
> >>>             |         Storage            |
> >>>
> >>> And around this we can add/improve on deployment (R8???),
> >>> virtualization/containers/clouds.  In other words - let's focus on the
> >>> vertical part of the stack, instead of simply supporting the status quo.
> >>>
> >>> Does Cassandra fits the Storage layer in that model? I don't know and
> >>> most
> >>> important - I don't care. If there's an interest and manpower to have
> >>> Cassandra-based stack - sure, but perhaps let's do as a separate branch
> >>> or
> >>> something, so we aren't over-complicating things. As Roman said earlier,
> >>> in
> >>> this case it'd be great to engage Cassandra/DataStax people into this
> >>> project.
> >>> But something tells me they won't be eager to jump on board.
> >>>
> >>> And finally, all this above leads to "how": how we can start reshaping
> >>> the
> >>> stack into its next incarnation? Perhaps, Ubuntu model might be an
> >>> answer for
> >>> that, but we have discussed that elsewhere and dropped the idea as it
> >>> wasn't
> >>> feasible back in the day. Perhaps its time just came?
> >>>
> >>> Apologies for a long post.
> >>>   Cos
> >>>
> >>>
> >>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
> >>> > Which other projects depend on BigTop?  How will the questions about
> >>> the
> >>> > direction of BigTop affect those projects?
> >>> >
> >>> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <roman@shaposhnik.org
> >>> >
> >>> > wrote:
> >>> >
> >>> > > Hi!
> >>> > >
> >>> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <
> >>> jayunit100.apache@gmail.com>
> >>> > > wrote:
> >>> > > > hi bigtop !
> >>> > > >
> >>> > > > I thought id start a thread a few vaguely related thoughts
i have
> >>> around
> >>> > > > next couple iterations of bigtop.
> >>> > >
> >>> > > I think in general I see two major ways for something like
> >>> > > Bigtop to evolve:
> >>> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
> >>> > >         how these pieces need to be integrated
> >>> > >    #2 start driving oppinioned use-cases for the particular kind
of
> >>> > >         bigdata workloads
> >>> > >
> >>> > > #1 is sort of what all of the Linux distros have been doing for
> >>> > > the majority of time they existed. #2 is close to what CentOS
> >>> > > is doing with SIGs.
> >>> > >
> >>> > > Honestly, given the size of our community so far and a total
> >>> > > lack of corporate backing (with a small exception of Cloudera
> >>> > > still paying for our EC2 time) I think #1 is all we can do. I'd
> >>> > > love to be wrong, though.
> >>> > >
> >>> > > > 1) Hive:  How will bigtop to evolve to support it, now that
it is
> >>> much
> >>> > > more
> >>> > > > than a mapreduce query wrapper?
> >>> > >
> >>> > > I think Hive will remain a big part of Hadoop workloads for
> >>> forseeable
> >>> > > future. What I'd love to see more of is rationalizing things like
how
> >>> > > HCatalog, etc. need to be deployed.
> >>> > >
> >>> > > > 2) I wonder wether we should confirm cassandra interoperability
of
> >>> spark
> >>> > > in
> >>> > > > bigtop distros,
> >>> > >
> >>> > > Only if there's a significant interest from cassandra community
and
> >>> even
> >>> > > then my biggest fear is that with cassandra we're totally changing
> >>> the
> >>> > > requirements for the underlying storage subsystem (nothing wrong
with
> >>> > > that, its just that in Hadoop ecosystem everything assumes very
> >>> HDFS'ish
> >>> > > requirements for the scale-out storage).
> >>> > >
> >>> > > > 4) in general, i think bigtop can move in one of 3 directions.
> >>> > > >
> >>> > > >   EXPAND ? : Expanding to include new components, with just
basic
> >>> > > interop,
> >>> > > > and let folks evolve their own stacks on top of bigtop on
their
> >>> own.
> >>> > > >
> >>> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of
core
> >>> > > components,
> >>> > > > with super high quality.
> >>> > > >
> >>> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform
for
> >>> just
> >>> > > > hadoop's direct ecosystem.
> >>> > > >
> >>> > > > I am intrigued by the idea of A and B both have clear benefits
and
> >>> > > costs...
> >>> > > > would like to see the opinions of folks --- do we  lean in
one
> >>> direction
> >>> > > or
> >>> > > > another? What is the criteria for adding a new feature, package,
> >>> stack to
> >>> > > > bigtop?
> >>> > > >
> >>> > > > ... Or maybe im just overthinking it and should be spending
this
> >>> time
> >>> > > > testing spark for 0.9 release....
> >>> > >
> >>> > > I'd love to know what other think, but for 0.9 I'd rather stay
the
> >>> course.
> >>> > >
> >>> > > Thanks,
> >>> > > Roman.
> >>> > >
> >>> > > P.S. There are also market forces at play that may fundamentally
> >>> change
> >>> > > the focus of what we're all working on in the year or so.
> >>> > >
> >>>
> >>
> >>
> >>
> >> --
> >> Best regards,
> >>
> >>    - Andy
> >>
> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> >> (via Tom White)
> >>
> >
> >
> 
> 
> -- 
> Best regards,
> 
>    - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Mime
View raw message