incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Byung-Gon Chun <bgc...@gmail.com>
Subject [PROPOSAL] Onyx - proposal for Apache Incubation
Date Thu, 25 Jan 2018 23:28:00 GMT
Dear Apache Incubator Community,

Please accept the following proposal for presentation and discussion:
https://wiki.apache.org/incubator/OnyxProposal

Onyx is a data processing system that aims to flexibly control the runtime
behaviors of a job to adapt to varying deployment characteristics (e.g.,
harnessing transient resources in datacenters, cross-datacenter deployment,
changing runtime based on job characteristics, etc.). Onyx provides ways to
extend the system’s capabilities and incorporate the extensions to the
flexible job execution.
Onyx translates a user program (e.g., Apache Beam, Apache Spark) into an
Intermediate Representation (IR) DAG, which Onyx optimizes and deploys
based on a deployment policy.

I've attached the proposal below.

Best regards,
Byung-Gon Chun

= OnyxProposal =

== Abstract ==
Onyx is a data processing system for flexible employment with
different execution scenarios for various deployment characteristics
on clusters.

== Proposal ==
Today, there is a wide variety of data processing systems with
different designs for better performance and datacenter efficiency.
They include processing data on specific resource environments and
running jobs with specific attributes. Although each system
successfully solves the problems it targets, most systems are designed
in the way that runtime behaviors are built tightly inside the system
core to hide the complexity of distributed computing. This makes it
hard for a single system to support different deployment
characteristics with different runtime behaviors without substantial
effort.

Onyx is a data processing system that aims to flexibly control the
runtime behaviors of a job to adapt to varying deployment
characteristics. Moreover, it provides a means of extending the
system’s capabilities and incorporating the extensions to the flexible
job execution.

In order to be able to easily modify runtime behaviors to adapt to
varying deployment characteristics, Onyx exposes runtime behaviors to
be flexibly configured and modified at both compile-time and runtime
through a set of high-level graph pass interfaces.

We hope to contribute to the big data processing community by enabling
more flexibility and extensibility in job executions. Furthermore, we
can benefit more together as a community when we work together as a
community to mature the system with more use cases and understanding
of diverse deployment characteristics. The Apache Software Foundation
is the perfect place to achieve these aspirations.

== Background ==
Many data processing systems have distinctive runtime behaviors
optimized and configured for specific deployment characteristics like
different resource environments and for handling special job
attributes.

For example, much research have been conducted to overcome the
challenge of running data processing jobs on cheap, unreliable
transient resources. Likewise, techniques for disaggregating different
types of resources, like memory, CPU and GPU, are being actively
developed to use datacenter resources more efficiently. Many
researchers are also working to run data processing jobs in even more
diverse environments, such as across distant datacenters. Similarly,
for special job attributes, many works take different approaches, such
as runtime optimization, to solve problems like data skew, and to
optimize systems for data processing jobs with small-scale input data.

Although each of the systems performs well with the jobs and in the
environments they target, they perform poorly with unconsidered cases,
and do not consider supporting multiple deployment characteristics on
a single system in their designs.

For an application writer to optimize an application to perform well
on a certain system engraved with its underlying behaviors, it
requires a deep understanding of the system itself, which is an
overhead that often requires a lot of time and effort. Moreover, for a
developer to modify such system behaviors, it requires modifications
of the system core, which requires an even deeper understanding of the
system itself.

With this background, Onyx is designed to represent all of its jobs as
an Intermediate Representation (IR) DAG. In the Onyx compiler, user
applications from various programming models (ex. Apache Beam) are
submitted, transformed to an IR DAG, and optimized/customized for the
deployment characteristics. In the IR DAG optimization phase, the DAG
is modified through a series of compiler “passes” which reshape or
annotate the DAG with an expression of the underlying runtime
behaviors. The IR DAG is then submitted as an execution plan for the
Onyx runtime. The runtime includes the unmodified parts of data
processing in the backbone which is transparently integrated with
configurable components exposed for further extension.

== Rationale ==
Onyx’s vision lies in providing means for flexibly supporting a wide
variety of job execution scenarios for users while facilitating system
developers to extend the execution framework with various
functionalities at the same time. The capabilities of the system can
be extended as it grows to meet a more variety of execution scenarios.
We require inputs from users and developers from diverse domains in
order to make it a more thriving and useful project. The Apache
Software Foundation provides the best tools and community to support
this vision.

== Initial Goals ==
Initial goals will be to move the existing codebase to Apache and
integrate with the Apache development process. We further plan to
develop our system to meet the needs for more execution scenarios for
a more variety of deployment characteristics.

== Current Status ==
Onyx codebase is currently hosted in a repository at github.com. The
current version has been developed by system developers at Seoul
National University, Viva Republica, Samsung, and LG.

== Meritocracy ==
We plan to strongly support meritocracy. We will discuss the
requirements in an open forum, and those that continuously contribute
to Onyx with the passion to strengthen the system will be invited as
committers. Contributors that enrich Onyx by providing various use
cases, various implementations of the configurable components
including ideas for optimization techniques will be especially
welcome. Committers with a deep understanding of the system’s
technical aspects as a whole and its philosophy will definitely be
voted as the PMC. We will monitor community participation so that
privileges can be extended to those that contribute.

== Community ==
We hope to expand our contribution community by becoming an Apache
incubator project. The contributions will come from both users and
system developers interested in flexibility and extensibility of job
executions that Onyx can support. We expect users to mainly contribute
to diversify the use cases and deployment characteristics, and
developers to  contribute to implement them.

== Alignment ==
Apache Spark is one of many popular data processing frameworks. The
system is designed towards optimizing jobs using RDDs in memory and
many other optimizations built tightly within the framework. In
contrast to Spark, Onyx aims to provide more flexibility for job
execution in an easy manner.

Apache Tez enables developers to build complex task DAGs with control
over the control plane of job execution. In Onyx, a high-level
programming layer (ex. Apache Beam) is automatically converted to a
basic IR DAG and can be converted to any IR DAG through a series of
easy user writable passes, that can both reshape and modify the
annotation (of execution properties) of the DAG. Moreover, Onyx leaves
more parts of the job execution configurable, such as the scheduler
and the data plane. As opposed to providing a set of properties for
solid optimization, Onyx’s configurable parts can be easily extended
and explored by implementing the pre-defined interfaces. For example,
an arbitrary intermediate data store can be added.

Onyx currently supports Apache Beam programs and we are working on
supporting Apache Spark programs as well. Onyx also utilizes Apache
REEF for container management, which allows Onyx to run in Apache YARN
and Apache Mesos clusters. If necessary, we plan to contribute to and
collaborate with these other Apache projects for the benefit of all.
We plan to extend such integrations with more Apache softwares. Apache
software foundation already hosts many major big-data systems, and we
expect to help further growth of the big-data community by having Onyx
within the Apache foundation.

== Known Risks ==
=== Orphaned Products ===
The risk of the Onyx project being orphaned is minimal. There is
already plenty of work that arduously support different deployment
characteristics, and we propose a general way to implement them with
flexible and extensible configuration knobs. The domain of data
processing is already of high interest, and this domain is expected to
evolve continuously with various other purposes, such as resource
disaggregation and using transient resources for better datacenter
resource utilization.

=== Inexperience with Open Source ===
The initial committers include PMC members and committers of other
Apache projects. They have experience with open source projects,
starting from their incubation to the top-level. They have been
involved in the open source development process, and are familiar with
releasing code under an open source license.

=== Homogeneous Developers ===
The initial set of committers is from a limited set of organizations,
but we expect to attract new contributors from diverse organizations
and will thus grow organically once approved for incubation. Our prior
experience with other open source projects will help various
contributors to actively participate in our project.

=== Reliance on Salaried Developers ===
Many developers are from Seoul National University. This is not applicable.

=== Relationships with Other Apache Products ===
Onyx positions itself among multiple Apache products. It runs on
Apache REEF for container management. It also utilizes many useful
development tools including Apache Maven, Apache Log4J, and multiple
Apache Commons components. Onyx supports the Apache Beam programming
model for user applications. We are currently working on supporting
the Apache Spark programming APIs as well.

=== An Excessive Fascination with the Apache Brand ===
We hope to make Onyx a powerful system for data processing, meeting
various needs for different deployment characteristics, under a more
variety of environments. We see the limitations of simply putting code
on GitHub, and we believe the Apache community will help the growth of
Onyx for the project to become a positively impactful and innovative
open source software. We believe Onyx is a great fit for the Apache
Software Foundation due to the collaboration it aims to achieve from
the big data processing community.

== Documentation ==
The current documentation for Onyx is at https://snuspl.github.io/onyx/.

== Initial Source ==
The Onyx codebase is currently hosted at https://github.com/snuspl/onyx.

== External Dependencies ==
To the best of our knowledge, all Onyx dependencies are distributed
under Apache compatible licenses. Upon acceptance to the incubator, we
would begin a thorough analysis of all transitive dependencies to
verify this fact and further introduce license checking into the build
and release process.

== Cryptography ==
Not applicable.

== Required Resources ==
=== Mailing Lists ===
We will operate two mailing lists as follows:
   * Onyx PMC discussions: private@onyx.incubator.apache.org
   * Onyx developers: dev@onyx.incubator.apache.org

=== Git Repositories ===
Upon incubation: https://github.com/apache/incubator-onyx.
After the incubation, we would like to move the existing repo
https://github.com/snuspl/onyx to the Apache infrastructure

=== Issue Tracking ===
Onyx currently tracks its issues using the Github issue tracker:
https://github.com/snuspl/onyx/issues. We plan to migrate to Apache
JIRA.

== Initial Committers ==
  * Byung-Gon Chun
  * Jeongyoon Eo
  * Geon-Woo Kim
  * Joo Yeon Kim
  * Gyewon Lee
  * Jung-Gil Lee
  * Sanha Lee
  * Wooyeon Lee
  * Yunseong Lee
  * JangHo Seo
  * Won Wook Song
  * Taegeon Um
  * Youngseok Yang

== Affiliations ==
  * SNU (Seoul National University)
    * Byung-Gon Chun
    * Jeongyoon Eo
    * Geon-Woo Kim
    * Gyewon Lee
    * Sanha Lee
    * Wooyeon Lee
    * Yunseong Lee
    * JangHo Seo
    * Won Wook Song
    * Taegeon Um
    * Youngseok Yang

  * LG
    * Jung-Gil Lee

  * Samsung
    * Joo Yeon Kim

  * Viva Republica
    * Geon-Woo Kim

== Sponsors ==
=== Champions ===
Byung-Gon Chun

=== Mentors ===
  * Hyunsik Choi
  * Byung-Gon Chun
  * Markus Weimer
  * Reynold Xin

=== Sponsoring Entity ===
The Apache Incubator



-- 
Byung-Gon Chun

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message