incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "DrElephantProposal" by CarlSteinbach
Date Tue, 27 Feb 2018 20:42:56 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "DrElephantProposal" page has been changed by CarlSteinbach:
https://wiki.apache.org/incubator/DrElephantProposal

Comment:
Initial version, needs formatting

New page:
== ABSTRACT ==
Dr. Elephant is a performance monitoring and tuning service for jobs and workflows that run
on 
Apache Hadoop and Apache Spark. While the system is primarily aimed at developers, we have
discovered that it is also popular with cluster operators who find it useful for monitoring
the health of the workloads running on their clusters.

= PROPOSAL =
Dr. Elephant is a service that helps users of Apache Hadoop and Apache Spark understand, analyze,
and improve the performance of the jobs and workflows running on their clusters. It automatically
gathers metrics, performs analysis, and presents the results along with actionable advice.
The goal of the project is to improve developer productivity and increase cluster efficiency
by reducing the time and domain expertise required to diagnose and treat sick jobs. It analyzes
Hadoop and Spark jobs using a set of configurable, extensible, rule-based heuristics that
provide insights on job performance, and then uses this information to provide recommendations
about how to tune jobs to make them run more efficiently.

=== BACKGROUND ===
The Hadoop ecosystem at LinkedIn is very diverse. Backend metrics systems, experimentation
systems, data products, and over a dozen data processing frameworks run on our Hadoop infrastructure.
Everything from business analyst reporting to the systems our members interact with on a daily
basis (e.g., People You May Know) uses Hadoop. Close to a thousand users interact with this
infrastructure, and hundreds of thousands of data flows run on it every month. These numbers
continue to grow.

The efficient operation of a Hadoop cluster requires careful tuning of both the cluster infrastructure
and the jobs that run on it. Tuning Hadoop and Spark jobs is a nontrivial task for several
reasons. While we do have a team of Hadoop experts at LinkedIn, we realized that it was a
very inefficient use of their time to make them responsible for assisting all users with optimally
tuning their own Hadoop jobs. At the same time, it would be equally inefficient to try and
train the thousands of Hadoop users at the company on the intricacies of the tuning process.

While there were a variety of existing operator tools like Ganglia and Nagios which can help
tuning clusters at the global level, we soon realized that there were no good solutions aimed
at users for the workflow and job-level. The difficulty of building such a tool is compounded
by the diversity and velocity of the Hadoop ecosystem, as well as the by the challenge of
making any solution accessible to users with a wide variety of backgrounds and skill levels.

Over the years of working with Hadoop, these separate factors—users with varying levels
of Hadoop experience, a large number of systems using the Hadoop infrastructure, and a smaller
core team of experts—led to recurring issues at LinkedIn. We found that sub-optimized jobs
were wasting the time of our users, using our hardware in an inefficient manner, and making
it difficult for us to scale the efforts of the core Hadoop team.

Dr. Elephant was developed at LinkedIn to address the above issues, spreading the best practices,
instilling rules and tuning tips experienced Hadoop developers have discovered in their experience.
Dr. Elephant was open sourced in 2006 after running successfully for 2 years at Linkedin.

A lot has happened in the time since. Activity on Github and the Dr. Elephant mailing list
has been strong since day one, and the Dr. Elephant developers at LinkedIn have made it a
priority to answer questions and handle pull requests. Most of the development goals listed
in the original Dr. Elephant blog post have been accomplished, and many of these — including
support for the Oozie and Airflow workflow schedulers, improved metrics, and enhancements
to the Spark history fetcher and Spark heuristics — were contributed by developers outside
of LinkedIn. We have also been happy to see that many people have been able to benefit from
running Dr. Elephant including companies like Airbnb, Foursquare, Hulu, Pinterest, and more.
Many of these new users have already contributed back to Dr. Elephant, and we’ve even gotten
interest from companies who wish to integrate Dr. Elephant into their commercial product offerings,
including Pepperdata and their new Application Profiler product.

=== RATIONALE ===
Dr. Elephant's entry to the Apache umbrella is beneficial to both the Dr. Elephant and the
Apache communities. Dr. Elephant has greatly benefited from its open source roots. Its community
and adoption has grown greatly as a result. More importantly, the feedback from the community
whether through interactions at meetups or through the mailing list have allowed for a rich
exchange of ideas. We believe a partnership with the Apache Foundation is the logical next
step. The Dr. Elephant community will greatly benefit from the established development and
consensus processes that have worked well for other projects. The Apache process has served
many other open source projects well and we believe that the Dr. Elephant community will greatly
benefit from these practices as well.

=== INITIAL GOALS ===
* Migrate the existing codebase to Apache
* Study and integrate with the Apache development process
* Ensure all dependencies are compliant with Apache License version 2.0
* Incremental development and releases per Apache guidelines
* Diversify the set of core developers and committers

=== CURRENT STATUS ===
Dr. Elephant has been in active development in the open source community since April 2016.
Currently we are aware of at least 10 organizations that are running Dr. Elephant and they
have been proactive in contributing back to open source. Dr. Elephant has also been integrated
into commercial products like Pepperdata Application Profiler.

The Dr. Elephant codebase is currently hosted at github.com, which will seed the Apache Git
repository.

=== MERITOCRACY ===
We plan to invest in supporting a meritocracy. We will discuss the requirements in an open
forum. Several companies have already expressed interest in this project, and we intend to
invite additional developers to participate. We will encourage and monitor community participation
so that privileges can be extended to those that contribute.

=== COMMUNITY ===
The need for a simple and understandable performance monitoring and tuning service for Hadoop
and Spark is tremendous. Dr. Elephant is currently being used by at least 10 organizations
worldwide (some examples are listed here). By bringing Dr.Elephant into Apache, we believe
that the community will grow even bigger.

=== CORE DEVELOPERS ===
Dr. Elephant was started by engineers at LinkedIn in the US and India offices and still continues
to be developed this way. We have received contributions from developers across the global
but haven’t explicitly called out anyone as core developers yet.

=== ALIGNMENT ===
Dr. Elephant aligns exceedingly well with the Apache ecosystem. Dr. Elephant has a clean interface
and can be easily extended and integrated with other open source projects. Our hope is that
users of Apache Hadoop and Spark will quickly adopt Dr. Elephant due to its huge value addition.

=== KNOWN RISKS ===

Orphaned products
The risk of the Dr. Elephant project being abandoned is minimal. As noted earlier, there are
many organizations that have already invested in Dr.Elephant significantly and are thus incentivized
to continue development. Companies like PepperData have already integrated Dr. Elephant into
their commercial products.

Moreover, Dr. Elephant aims at optimizing the valuable developer and cluster resources for
the organization. Hence contributing developer resources to this project is a cost saving
effort.

Inexperience with Open Source
Dr. Elephant has existed as a healthy open source project for the last year. During that time,
we have curated an open-source community successfully. Any risks that we foresee are ones
associated with scaling our open source communication and operation process rather than with
inherent inexperience in operating an open source project. 

Homogenous Developers
Apart from Linkedin’s developers, Dr. Elephant has developers from Airbnb, Pepperdata, Flipkart,
Hulu, Foursquare, Altiscale, PayPal, Evariant, Didi, Trivago, Cardlytics and many other companies
across the globe.

A lot of effort has been put for efficient communication between all the developers. We have
set up different forums for communication like github issues, google groups mailing list,
gitter chat, weekly hangouts, and frequent meetups.

Besides, Dr. Elephant has close relationship with Apache Hadoop and Apache Spark, especially
in tuning these jobs and so we expect these three separate developer communities to overlap.

Reliance on Salaried Developers
It is expected that Dr. Elephant development will occur on both salaried time and on volunteer
time, after hours. Many of the initial committers are paid by their employer to contribute
to this project. However, they are all passionate about the project, and we are confident
that the project will continue even if no salaried developers contribute to the project. We
are committed to recruiting additional committers including non-salaried developers.

Relationships with Other Apache Products
Dr. Elephant supports Apache Hadoop And Apache Spark. Possibly more computation frameworks
in the future. Anything that is tunable and episodic is a potential target.

A Excessive Fascination with the Apache Brand
Dr. Elephant is already a healthy and familiar open source project and it is backed by a well
known company with extensive reach through its technical brand and engineering blog. This
proposal is not for the purpose of generating publicity. Rather, the primary benefits to joining
the Apache Software Foundation are already outlined in the Rationale section.

Documentation
https://github.com/linkedin/dr-elephant/wiki

Initial Source
https://github.com/linkedin/dr-elephant 

Source and Intellectual Property Submission Plan
The Dr. Elephant codebase is currently hosted on Github. This is the exact codebase that we
would migrate to the Apache Software Foundation. The Dr. Elephant source code is already licensed
under Apache License Version 2.0. Going forward, we will continue to have all the contributions
licensed directly to the Apache Software Foundation through our signed Individual Contributor
License Agreements for all of the committers on the project.

External Dependencies
To the best of our knowledge, all of Dr. Elephant’s dependencies are distributed under Apache
Software Foundation compatible licenses. Upon acceptance to the incubator, we will begin a
thorough analysis of all transitive dependencies to verify this fact and introduce license
checking into the build and release process (for instance integrating Apache Rat).


Cryptography
We do not expect Dr. Elephant to be a controlled export item due to the use of encryption.

Required Resources
Mailing lists
dr-elephant-user (Our existing dr-elephant-users@googlegroups will be migrated to this)
dr-elephant-dev
dr-elephant-commit
dr-elephant-private for private PMC discussions (with moderated subscriptions)

Git Repository
Git is the preferred source control system: git://git.apache.org/dr-elephant

Issue Tracking
JIRA  Dr. Elephant (DOCTOR)

Other Resources
The existing code already has unit and integration tests, so we would like a Jenkins instance
to run them whenever a new patch is submitted. This can be added after project creation.

Initial Committers
Akshay Rai <akshayrai09 at gmail dot com>
Anant Nag <nntnag17 at gmail dot com>
Chetna Chaudhari <chetnachaudhari at gmail dot com>
Clemens Valiente <clemens dot valiente at gmail dot com>
Fangshi Li <shengzhixia at gmail dot com>
George Wu <georgieewuu at gmail dot com>
Krishna Puttaswamy <krishnaprasad dot pn at gmail dot com>
Maxime Kestemont <maxkestemont at hotmail dot com>
Noam Shaish <noamshaish at gmail dot com>
Paul Reed Bramsen <prb at paulbramsen dot com>
Ragesh K R <ragesh dot rajagopalan at gmail dot com>
Shankar Manian <shankar37 at gmail dot com>
Shahrukh Khan <shahrukhkhan489 at gmail dot com>
Shekhar Gupta <shkhrgptat gmail dot com>
Shida Li <lishid at gmail dot com>

Affiliations
Akshay Rai - Linkedin
Anant Nag - Linkedin
Chetna Chaudhari - SkyTv New Zealand
Clemens Valiente - trivago GmbH
Fangshi Li - Linkedin
George Wu - Pinterest
Krishna Puttaswamy - Airbnb
Mark Wagner - Linkedin
Maxime Kestemont - Criteo
Noam Shaish - Nordea Bank
Ragesh K R - Linkedin
Shankar Manian - Linkedin
Shahrukh Khan - Hortonworks
Shekhar Gupta - Pepperdata
Shida Li - Dynalist Inc.

Sponsors
Champion
Carl Steinbach - cws@apache.org
Nominated Mentors
Sponsoring Entity
The Apache Incubator


---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message