incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Incubator Wiki] Update of "CloudbreakProposal" by SteveLoughran
Date Tue, 10 Nov 2015 14:18:13 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "CloudbreakProposal" page has been changed by SteveLoughran:

New page:
= Cloudbreak Proposal =

== Abstract ==

Apache Cloudbreak will be a Docker-based tool for provisioning and managing Apache Hadoop
clusters in cloud infrastructures.

== Proposal ==

Cloudbreak will automate Hadoop cluster deployment in the cloud. It’s autoscaling functionality
will enable more efficient usage of cloud platforms by expanding and contracting the cluster
based on Hadoop usage metrics and defined policies. It will offer centralized and secure interaction
with the Hadoop cluster through a Web UI, REST API and CLI shell —across all supported cloud

Cloudbreak will be built using Docker containers, deployed across multiple cloud providers:
Microsoft Azure, Amazon AWS, Google Cloud Platform, OpenStack. It will use Apache Ambari,
Docker containers, Swarm and Consul as underlying technologies.

== Background ==

Apache Hadoop is now a near-universal foundation for large scale data analytics platforms.
What was a few years ago a complex system used by only a few web companies is now a base component
for all at-scale data storage and processing systems. A metric of its success is that even
companies which have build their business around RDBMs systems are now selling hardware explicitly
designed for Hadoop.

Where Hadoop has been weaker is in in-cloud deployment; there's an implicit assumption in
the code that it is deployed on physical clusters: node failures are independent and they
usually return with the same address and data. Most critically: that the size of the Hadoop
cluster is relatively constant, or slowly expanding: any dramatic shrinking of cluster size
is a failure, not a common event the system must adapt to.

Furthermore, there is no current open source tool for dynamic Hadoop clusters across multiple
cloud infrastructures. Apache Whirr is retired; OpenStack Savannah targets OpenStack alone,
while the Apache Provisionr died prematurely during incubation.

As a result, in-cloud deployment is primarily offered by the cloud infrastructure providers
themselves, without the benefits of both an open source development model, and tying users
into specific infrastructures.

Cloudbreak was initially developed by SequenceIQ to allow users to automate Hadoop cluster
deployment in the cloud. It is designed to automate and simplify deployment and infrastructure
management of Hadoop clusters in many cloud infrastructures, and docker-based systems —so
providing truly platform-agnostic agile deployment options for Hadoop.

Cloudbreak is already used in production Hadoop clusters and in test infrastructure where
lightweight docker-based virtual clusters have aided the product release process.

== Rationale ==

Cloudbreak will address usability gaps in Hadoop cluster deployment in different cloud environments
by providing an easy and consistent mechanism to do so. 

The project aims to grow a community to help build a widely adopted deployment and management
toolset for Hadoop and related services for any cloud or Docker based deployment environment
– which will advance the interests for the whole community.

We also hope to drive better support for cloud deployments within Hadoop itself. That includes
better support for dynamic clusters, through data and placement strategies, the metrics and
monitoring needed for better dynamic cluster sizing decisions, and more agile client/server

== Initial Goals ==

A key initial goal is converting an in-house project into an ASF open source project with
a community around it. The strategy there will be: regular releases with engagement with as
many interested projects as possible. We will view all users as potential developers and encourage
them to get involved —even if it is just filing bug reports and feature requests.

The current code is functional; making sure anyone can build it will be achieved by encouraging
people to try it, and by setting up Apache Jenkins to build and tests the code and submitted

All cloud infrastructure projects have the perennial problem of billing: you need to pay to
test, and those credentials cannot be exposed to jenkins. The Cloudbreak test suites will
have to be designed to executable by all, and support local openstack hosts/VMs for zero-cost
test runs. We will also encourage developers to aid in developing the test infrastructure
as much as the production code. 

== Current Status ==

An initial version with the core set of features is developed by the list of initial committers
and is hosted on github.

It supports:

 1. Automated Hadoop provisioning in the supported environments.
 1. Declarative clusters through Blueprints.
 1. Automated Kerberization. 
 1. Advanced security with network, security groups, and other mechanisms.
 1. Auto-scaling based on over 400+ metrics, time and triggers.
 1. SPI interface to bring new providers under management (both API and template based).
 1. Extensibility with custom recipes —hooks into the lifecycle of the provisioning.

=== Meritocracy ===

We recognise how critical it is to have a broad user and developer community: today's users
are tomorrow's developers.

Our intent with this proposal is to start building a diverse developer community around Cloudbreak
following the Apache meritocracy model. We have wanted to make the project open source and
encourage contributors from multiple organizations from the start.

We plan to provide substantial support to new developers and to quickly recruit those who
make solid contributions to committer status.

=== Community ===

While the project has started within a single organisation, we are already receiving contributions
on GitHub from others (most recently, Symantec).

We hope to extend the user and developer base further in the future and build a solid open
source community around Cloudbreak. As mentioned, Cloudbreak is already in use for generating
test clusters for Hadoop itself; this should motivate interest in and use of the system.

=== Core Developers ===

Cloudbreak development is currently being led by engineers from Hortonworks – Janos Matyas,
Lajos Papp  and Attila Kanto. All the engineers have deep expertise in Cloud, Docker, Hadoop
cluster management and are familiar with the Hadoop Ecosystem.

=== Alignment ===

The ASF is a natural host for Cloudbreak given that it is already the home of Hadoop, HBase,
Hive, Falcon, Pig, Oozie, Spark, Knox, Ranger, and other emerging “big data” and cloud-related
software projects.

Cloudbreak fills the gap that the Hadoop ecosystem has been lacking in the areas of cloud
and docker based deployment options, and can provide a focus for adding cloud-first features
into Hadoop itself.

== Known Risks ==

=== Orphaned products & Reliance on Salaried Developers ===

The core developers plan to work full time on the project. There is very little risk of Cloudbreak
getting orphaned. Cloudbreak is in production use already in some enterprises.

=== Inexperience with Open Source ===

The original author, Janos Matyas, is a committer on Apache Ambari so has experience in OSS
development. Many of the others have worked on OSS projects, and worked closely with developers
working full time on other apache projects —in particular, Apache Hadoop, YARN and Ambari.

=== Homogeneous Developers ===

The current core developers are from diverse set of organizations such as Hortonworks, HP,
Symantec. We expect to quickly establish a developer community that includes contributions
from additional organizations during and after the incubation process.

=== Reliance on Salaried Developers ===

Currently, most developers are paid to do work on Cloudbreak but a few are contributing in
their spare time. However, once the project has a community built around it post incubation,
we expect to get additional committers and developers from outside the current core developers.

=== Relationships with Other Apache Products ===

Cloudbreak is going to be used by the users of Apache Hadoop and the Hadoop ecosystem in general
– particularly with Apache Ambari for rationalizing deployment in Cloud or cloud-like deployment

We plan to use Apache Yetus (incubating) to qualify patches and so aid patch submission.

With our goal of providing a broadly usable platform for cloud-hosting of the Hadoop stack,
we hope to work with interested ASF projects for testing their code in these environments,
and, ideally, adapting their code to work better in cloud infrastructure —improvements which
will be generic to any cloud deployment platform.

=== An Excessive Fascination with the Apache Brand ===

While we respect the reputation of the Apache brand and have no doubts that it will attract
contributors and users, our interest is primarily to give Cloudbreak a solid home as an open
source project following an established development model.

== Documentation ==

The current documentation is hosted at [[]]
Importing and unbranding this data to the ASF repositories will be an early step in the incubation

== Initial Source ==
The source is currently hosted at: [[|github/cloudbreak]]

== Source and Intellectual Property Submission Plan ==

The complete Cloudbreak code already licensed under the under ASF v2 license. This covers
submissions from from multiple parties to the GitHub-hosted code as it stands today.

== External Dependencies ==

The dependencies all have Apache compatible licenses. These include BSD, MIT and other licensed

== Cryptography ==

== Required Resources ==

Git hosting of source code, JIRA, Jenkins, Mailing lists.

=== Proposed Mailing lists ===

 * cloudbreak-dev AT incubator DOT apache DOT org
 * cloudbreak-commits AT incubator DOT apache DOT org
 * cloudbreak-private AT incubator DOT apache DOT org

=== Subversion Directory ===

Git is the preferred source control system. We propose the repository and a mirror on GitHub.

=== Issue Tracking ===

JIRA, with the project name CLOUDBREAK.

== Initial Committers ==
 * Oliver Szabo
 * Ferenc Schneider
 * Marton Sereg
 * Lajos Papp
 * Janos Matyas
 * Richard Doktorics
 * Attila Kanto
 * Krisztian Horvath
 * Laszlo Puskas
 * Tamas Bihari
 * Balazs Bihari
 * Richard Kovacs

== Affiliations ==

 * Oliver Szabo (Hortonworks)
 * Ferenc Schneider (Hortonworks)
 * Marton Sereg (Hortonworks)
 * Lajos Papp (Hortonworks)
 * Janos Matyas (Hortonworks)
 * Richard Doktorics (Hortonworks)
 * Attila Kanto (Hortonworks)
 * Krisztian Horvath (Hortonworks)
 * Laszlo Puskas (Hortonworks)
 * Tamas Bihari (Hortonworks)
 * Balazs Bihari (Hortonworks)
 * Richard Kovacs (Hortonworks)

== Sponsors ==

=== Champion ===
 *  Steve Loughran (stevel AT apache DOT org)

=== Nominated Mentors ===
 * Steve Loughran (stevel AT apache DOT org)
 * Arun Murthy  (acmurthy AT apache DOT org)
 * Vinod Kumar Vavilapalli (vinodkv AT apache DOT org)
 * Sid Wagle [ swagle AT apache DOT org ]
 * Enis Soztutar [enis AT apache DOT org]
 * Jakob Homan [jghoman AT apache DOT org]

=== Sponsoring Entity ===
Incubator PMC

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message