bigtop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <>
Subject Re: What will the next generation of bigtop look like?
Date Thu, 11 Dec 2014 22:04:20 GMT
The problem I see with a Spark-only stack is, in my experience, Spark falls
apart as soon as the working set exceeds all available RAM on the cluster.
(One is presented with a sea of exceptions.) We need Hadoop anyway for HDFS
and Common (required by many many components), we get YARN and the MR
runtime as part of this package, and Hadoop MR is still eminently useful
when data sets and storage requirements are far beyond agg RAM.

We have an open JIRA for adding Kafka, it would be fantastic if someone
picks it up and brings it over the finish line.

On Thu, Dec 11, 2014 at 10:14 AM, RJ Nowling <> wrote:

> GraphX, Streaming, MLlib, and Spark SQL are all part of Spark and would be
> included in BigTop if Spark is included. They're also pretty well
> integrated with each other.
> I'd like to throw out a radical idea, based on Andrew's comments: focus on
> the vertical rather than the horizontal with a slimmed down, Spark-oriented
> stack.  (This could be a subset of the current stack.)'s work
> provides a nice example of a pure Spark stack.
> Spark offers a smaller footprint, far less maintenance, functionality of
> many Hadoop components in one (and better integration!), and is better
> suited for diverse deployment situations (cloud, non-HDFS storage, etc.)
> A few other complementary components would be needed: Kafka would be
> needed for HA with Spark streaming.  Tachyon.  Maybe offer Cassandra or
> similar as an alternative storage option.    Combine this with dashboards
> and visualization and high quality deployment options (Puppet, Docker,
> etc.).  With the data generator and Spark implementation of BigPetStore, my
> goal is to to expand BPS to provide high quality analytics examples,
> oriented more towards data scientists.
> Just a thought...
> On Thu, Dec 11, 2014 at 12:39 PM, Andrew Purtell <>
> wrote:
>> This is a really great post and I was nodding along with most of it.
>> My personal view is Bigtop starts as a deployable stack of Apache
>> ecosystem components for Big Data. Commodification of (Linux) deployable
>> packages and basic install integration is the baseline.
>> Bigtop packaging Spark components first is an unfortunately little known
>> win of this community, but its still a win. Although replicating that
>> success with choice of the 'next big thing' is going to be a hit or miss
>> proposition unless one of us can figure out time travel, definitely we can
>> make some observations and scour and/or influence the Apache project
>> landscape to pick up coverage in the space:
>> - Storage is commoditized. Nearly everyone bases the storage stack on
>> HDFS. Everyone does so with what we'd call HCFS. Best to focus elsewhere.
>> - Packaging is commoditized. It's a shame that vendors pursue misguided
>> lock-in strategies but we have no control over that. It's still true that
>> someone using HDP or CDH 4 can switch to Bigtop and vice versa without
>> changing package management tools or strategy. As a user of Apache stack
>> technologies I want long term sustainable package management so will vote
>> with my feet for the commodity option, and won't be alone. Bigtop should
>> provide this, and does, and it's mostly a solved problem.
>> - Deployment is also a "solved" problem but unfortunately everyone solves
>> it differently. :-) This is an area where Bigtop can provide real value,
>> and does, with the Puppet scripts, with the containerization work. One
>> function Bigtop can serve is as repository and example of Hadoop-ish
>> production tooling.
>> - YARN is a reasonably generic grid resource manager. We don't have the
>> resources to stand up an alternate RM and all the tooling necessary with
>> Mesos, but if Mesosphere made a contribution of that I suspect we'd take
>> it. From the Bigtop perspective I think computation framework options are
>> well handled, in that I don't see Bigtop or anyone else developing credible
>> alternatives to MR and Spark for some time. Not sure there's enough oxygen.
>> And we have Giraph (and is GraphX packaged with Spark?). To the extent
>> Spark-on-YARN has rough edges in the Bigtop framework that's an area where
>> contributors can produce value. Related, support for Hive on Spark, Pig on
>> Spark (spork).
>> - The Apache stack includes three streaming computation frameworks -
>> Storm, Spark Streaming, Samza - but Bigtop has mostly missed the boat here.
>> Spark streaming is included in the spark package (I think) but how well is
>> it integrated? Samza is well integrated with YARN but we don't package it.
>> There's also been Storm-on-YARN work out of Yahoo, not sure about what was
>> upstreamed or might be available. Anyway, integration of stream computation
>> frameworks into Bigtop's packaging and deployment/management scripts can
>> produce value, especially if we provide multiple options, because vendors
>> are choosing favorites.
>> - Data access. We do have players differentiating themselves here. Bigtop
>> provides two SQL options (Hive, Phoenix+HBase), can add a third, I see
>> someone's proposed Presto packaging. I'm not sure from the Bigtop
>> perspective we need to pursue additional alternatives, but if there were
>> contributions, we might well take them. "Enterprise friendly API" (SQL) is
>> half of the data access picture I think, the other half is access control.
>> There are competing projects in incubation, Sentry and Ranger, with no
>> shared purpose, which is a real shame. To the extent that Bigtop adopts a
>> cross-component full-stack access control technology, or helps bring
>> another alternative into incubation and adopts that, we can move the needle
>> in this space. We'd offer a vendor neutral access control option devoid of
>> lock-in risk, this would be a big deal for big-E enterprises.
>> - Data management and provenance. Now we're moving up the value chain
>> from storage and data access to the next layer. This is mostly greenfield /
>> blue ocean space in the Apache stack. We have interesting options in
>> incubation: Falcon, Taverna, NiFi. (I think the last one might be truly
>> comprehensive.) All of these are higher level data management and
>> processing workflows which include aspects of management and provenance.
>> One or more could be adopted and refined. There are a lot of relevant
>> integration opportunities up and down the stack that could be undertaken
>> with shared effort of the Bigtop, framework, and component communities.
>> - Machine learning. Moving further up the value chain, we have data and
>> computation and workflow, now how do we derive the competitive advantage
>> that all of the lower layer technologies are in place for? The new hotness
>> is surfacing of insights out of scaled parallel statistical inference.
>> Unfortunately this space doesn't present itself well to the toolbox
>> approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they
>> themselves are toolkits with components of varying utility and maturity
>> (and relevance). I think Bigtop could provide some value by curating ML
>> frameworks that tie in with other Apache stack technologies. ML toolkits
>> leave would-be users in the cold. One has to know what one is doing, and
>> what to do is highly use case specific, this is why "data scientists" can
>> command obscene salaries and only commercial vendors have the resources to
>> focus on specific verticals.
>> - Visualization and preparation. Moving further up, now we are almost
>> touching directly the use case. We have data but we need to clean it,
>> normalize, regularize, filter, slice and dice. Where there are reasonably
>> generic open source tools, preferably at Apache, for data preparation and
>> cleaning Bigtop could provide baseline value by packaging it, and
>> additional value with deeper integration with Apache stack components. Data
>> preparation is a concern hand in hand with data ingest, so we have an
>> interesting feedback loop from the top back down to ingest tools/building
>> blocks like Kafka and Flume. Data cleaning concerns might overlap with the
>> workflow frameworks too. If there's a friendly licensed open source
>> graphical front end to the data cleaning/munging/exploration process that
>> is generic enough that would be a really interesting "acquisition".
>> - We can also package visualization libraries and toolkits for building
>> dashboards. Like with ML algorithms, a complete integration is probably out
>> of scope because every instance would be use case and user specific.
>> On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <>
>> wrote:
>>> First I want to address the RJ's question:
>>> The most prominent downstream Bigtop Dependency would be any commercial
>>> Hadoop distribution like HDP and CDH. The former is trying to
>>> disguise their affiliation by pushing Ambari forward, and Cloudera's
>>> seemingly
>>> shifting her focus to compressed tarballs media (aka parcels) which
>>> requires
>>> a closed-source solutions like Cloudera Manager to deploy and control
>>> your
>>> cluster, effectively rendering it useless if you ever decide to
>>> uninstall the
>>> control software. In the interest of full disclosure, I don't think
>>> parcels
>>> have any chance to landslide the consensus in the industry from Linux
>>> packaging towards something so obscure and proprietary as parcels are.
>>> And now to my actual points....:
>>> I do strongly believe the Bigtop was and is the only completely
>>> transparent,
>>> vendors' friendly, and 100% sticking to official ASF product releases
>>> way of
>>> building your stack from ground up, deploying and controlling it anyway
>>> you
>>> want to. I agree with Roman's presentation on how this project can move
>>> forward. However, I somewhat disagree with his view on the perspectives.
>>> It
>>> might be a hard road to drive the opinion of the community.  But, it is
>>> a high
>>> road.
>>> We are definitely small and mostly unsupported by commercial groups that
>>> are
>>> using the framework. Being a box of LEGO won't win us anything. If
>>> anything,
>>> the empirical evidences are against it as commercial distros have
>>> decided to
>>> move towards their own means of "vendor lock-in" (yes, you hear me
>>> right - that's exactly what I said: all so called open-source companies
>>> have
>>> invented a way to lock-in their customers either with fancy "enterprise
>>> features" that aren't adding but amending underlying stack; or with
>>> custom set
>>> of patches oftentimes rendering the cluster to become incompatible
>>> between
>>> different vendors).
>>> By all means, my money are on the second way, yet slightly modified (as
>>> use-cases are coming from users, not developers):
>>>   #2 start driving adoption of software stacks for the particular kind
>>> of data workloads
>>> This community has enough day-to-day practitioners on board to
>>> accumulate a near-complete introspection of where the technology is
>>> moving.
>>> And instead of wobbling in a backwash, let's see if we can be smart and
>>> define
>>> this landscape. After all, Bigtop has adopted Spark well before any of
>>> the
>>> commercials have officially accepted it. We seemingly are moving more and
>>> more into in-memory realm of data processing: Apache Ignite (Gridgain),
>>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am
>>> doubtful,
>>> that it can walk for much longer... May be it's just me.
>>> In this thread we already discussed some of the
>>> aspects
>>> influencing the feature of this project. And we are de-facto working on
>>> the
>>> implementation. In my opinion, Hadoop has been more or less commoditized
>>> already. And it isn't a bad thing, but it means that the innovations are
>>> elsewhere. E.g. Spark moving is moving beyond its ties with storage
>>> layer via
>>> Tachyon abstraction; GridGain simply doesn't care what's underlying
>>> storage
>>> is. However, data needs to be stored somewhere before it can be
>>> processed. And
>>> HCFS seems to be fitting the bill ok. But, as I said already, I see the
>>> real
>>> action elsewhere. If I were to define the shape of our mid- to long'ish
>>> term
>>> roadmap it'd be something like that:
>>>             ^   Dashboard/Visualization  ^
>>>             |     OLTP/ML processing     |
>>>             |    Caching/Acceleration    |
>>>             |         Storage            |
>>> And around this we can add/improve on deployment (R8???),
>>> virtualization/containers/clouds.  In other words - let's focus on the
>>> vertical part of the stack, instead of simply supporting the status quo.
>>> Does Cassandra fits the Storage layer in that model? I don't know and
>>> most
>>> important - I don't care. If there's an interest and manpower to have
>>> Cassandra-based stack - sure, but perhaps let's do as a separate branch
>>> or
>>> something, so we aren't over-complicating things. As Roman said earlier,
>>> in
>>> this case it'd be great to engage Cassandra/DataStax people into this
>>> project.
>>> But something tells me they won't be eager to jump on board.
>>> And finally, all this above leads to "how": how we can start reshaping
>>> the
>>> stack into its next incarnation? Perhaps, Ubuntu model might be an
>>> answer for
>>> that, but we have discussed that elsewhere and dropped the idea as it
>>> wasn't
>>> feasible back in the day. Perhaps its time just came?
>>> Apologies for a long post.
>>>   Cos
>>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
>>> > Which other projects depend on BigTop?  How will the questions about
>>> the
>>> > direction of BigTop affect those projects?
>>> >
>>> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <
>>> >
>>> > wrote:
>>> >
>>> > > Hi!
>>> > >
>>> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <
>>> > > wrote:
>>> > > > hi bigtop !
>>> > > >
>>> > > > I thought id start a thread a few vaguely related thoughts i have
>>> around
>>> > > > next couple iterations of bigtop.
>>> > >
>>> > > I think in general I see two major ways for something like
>>> > > Bigtop to evolve:
>>> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
>>> > >         how these pieces need to be integrated
>>> > >    #2 start driving oppinioned use-cases for the particular kind of
>>> > >         bigdata workloads
>>> > >
>>> > > #1 is sort of what all of the Linux distros have been doing for
>>> > > the majority of time they existed. #2 is close to what CentOS
>>> > > is doing with SIGs.
>>> > >
>>> > > Honestly, given the size of our community so far and a total
>>> > > lack of corporate backing (with a small exception of Cloudera
>>> > > still paying for our EC2 time) I think #1 is all we can do. I'd
>>> > > love to be wrong, though.
>>> > >
>>> > > > 1) Hive:  How will bigtop to evolve to support it, now that it
>>> much
>>> > > more
>>> > > > than a mapreduce query wrapper?
>>> > >
>>> > > I think Hive will remain a big part of Hadoop workloads for
>>> forseeable
>>> > > future. What I'd love to see more of is rationalizing things like how
>>> > > HCatalog, etc. need to be deployed.
>>> > >
>>> > > > 2) I wonder wether we should confirm cassandra interoperability
>>> spark
>>> > > in
>>> > > > bigtop distros,
>>> > >
>>> > > Only if there's a significant interest from cassandra community and
>>> even
>>> > > then my biggest fear is that with cassandra we're totally changing
>>> the
>>> > > requirements for the underlying storage subsystem (nothing wrong with
>>> > > that, its just that in Hadoop ecosystem everything assumes very
>>> HDFS'ish
>>> > > requirements for the scale-out storage).
>>> > >
>>> > > > 4) in general, i think bigtop can move in one of 3 directions.
>>> > > >
>>> > > >   EXPAND ? : Expanding to include new components, with just basic
>>> > > interop,
>>> > > > and let folks evolve their own stacks on top of bigtop on their
>>> own.
>>> > > >
>>> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
>>> > > components,
>>> > > > with super high quality.
>>> > > >
>>> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform for
>>> just
>>> > > > hadoop's direct ecosystem.
>>> > > >
>>> > > > I am intrigued by the idea of A and B both have clear benefits
>>> > > costs...
>>> > > > would like to see the opinions of folks --- do we  lean in one
>>> direction
>>> > > or
>>> > > > another? What is the criteria for adding a new feature, package,
>>> stack to
>>> > > > bigtop?
>>> > > >
>>> > > > ... Or maybe im just overthinking it and should be spending this
>>> time
>>> > > > testing spark for 0.9 release....
>>> > >
>>> > > I'd love to know what other think, but for 0.9 I'd rather stay the
>>> course.
>>> > >
>>> > > Thanks,
>>> > > Roman.
>>> > >
>>> > > P.S. There are also market forces at play that may fundamentally
>>> change
>>> > > the focus of what we're all working on in the year or so.
>>> > >
>> --
>> Best regards,
>>    - Andy
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

View raw message