incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "ApexProposal" by P. Taylor Goetz
Date Wed, 05 Aug 2015 17:29:38 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "ApexProposal" page has been changed by P. Taylor Goetz:
https://wiki.apache.org/incubator/ApexProposal?action=diff&rev1=20&rev2=21

Comment:
minor grammatical corrections

  == Abstract ==
  Apex is an enterprise grade native YARN big data-in-motion platform that unifies stream
processing as well as batch processing. Apex processes big data in-motion in a highly scalable,
highly performant, fault tolerant, stateful, secure, distributed, and an easily operable way.
It provides a simple API that enables users to write or re-use generic Java code, thereby
lowering the expertise needed to write big data applications.
  
- Functional and operational specifications are separated. Apex is designed in a way to enable
users to write their own code (aka user defined functions) as is and leave all operability
to the platform. The API is very simple and is designed to allow users to drop in their code
as is. The platform mainly deals with operability and treats functional code as a black box.
Operability includes fault tolerance, scalability, security, ease of use, metrics api, webservices
etc. In other words there is no separation of UDF (user defined functions), as all functional
code is UDF. This frees users to focus on functional development, and lets platform provide
operability support. The same code runs as is with different operability attributes. The data-in-motion
architecture of Apex unifies stream as well as batch processing in a single platform. Since
Apex is a native YARN application, it leverages all the components of YARN without duplication.
Apex was developed with YARN in mind and has no overlapping components/functionality with
YARN.
+ Functional and operational specifications are separated. Apex is designed in a way to enable
users to write their own code (aka user defined functions) as is and leave all operability
to the platform. The API is very simple and is designed to allow users to drop in their code
as is. The platform mainly deals with operability and treats functional code as a black box.
Operability includes fault tolerance, scalability, security, ease of use, metrics api, webservices,
etc. In other words there is no separation of UDF (user defined functions), as all functional
code is UDF. This frees users to focus on functional development, and lets platform provide
operability support. The same code runs as is with different operability attributes. The data-in-motion
architecture of Apex unifies stream as well as batch processing in a single platform. Since
Apex is a native YARN application, it leverages all the components of YARN without duplication.
Apex was developed with YARN in mind and has no overlapping components/functionality with
YARN.
  
  The Apex platform is supplemented by project Malhar, which is a library of operators that
implement common business logic functions needed by customers who want to quickly develop
applications. These operators provide access to HDFS, S3, NFS, FTP, and other file systems;
 Kafka, ActiveMQ, RabbitMQ, JMS, and other message systems; MySql, Cassandra, MongoDB, Redis,
HBase, CouchDB and other databases along with JDBC connectors. The Malhar library also includes
a host of other common business logic patterns that help users to significantly reduce the
time it takes to go into production. Ease of integration with all other big data technologies
is one of the primary missions of Malhar.
  
  == Proposal ==
- The goal of this proposal is to establish the core engine of DataTorrent RTS product as
a Apache Software Foundation (ASF) project in order to build a vibrant, diverse, and self-governed
open source community around the technology. DataTorrent will continue to sell management
tools, application building tools, easy to use big data applications, and custom high end
business logic operators. This proposal covers the Apex source code (written in Java), Apex
documentation and other materials currently available on https://github.com/DataTorrent/Apex.
This proposal also covers the Malhar source code (written in Java), Malhar documentation,
and other materials currently available on https://github.com/DataTorrent/Malhar. We have
done a trademark check on the name Apex, and have concluded that the Apex name is likely to
be a suitable project name. 
+ The goal of this proposal is to establish the core engine of DataTorrent RTS product as
an Apache Software Foundation (ASF) project in order to build a vibrant, diverse, and self-governed
open source community around the technology. DataTorrent will continue to sell management
tools, application building tools, easy to use big data applications, and custom high end
business logic operators. This proposal covers the Apex source code (written in Java), Apex
documentation and other materials currently available on https://github.com/DataTorrent/Apex.
This proposal also covers the Malhar source code (written in Java), Malhar documentation,
and other materials currently available on https://github.com/DataTorrent/Malhar. We have
done a trademark check on the name Apex, and have concluded that the Apex name is likely to
be a suitable project name. 
  
  == Background ==
- DataTorrent RTS is a mature and robust product developed as a native YARN application. RTS
1.0 was launched in summer of 2014; RTS 2.0 was launched in Jan 2015. Both were well received
by customers. RTS is among the first enterprise grade platform that was developed from ground
up as native YARN application. DataTorrent RTS is currently maintained by engineers as a closed
source project. Even though the engineers behind RTS are experienced software engineers and
are knowledge leaders in data-in-motion platforms, they have had little exposure to the open
source governance process. Customers are currently running applications based on DataTorrent
RTS in production.
+ DataTorrent RTS is a mature and robust product developed as a native YARN application. RTS
1.0 was launched in summer of 2014; RTS 2.0 was launched in Jan 2015. Both were well received
by customers. RTS is among the first enterprise grade platform that was developed from the
ground up as native YARN application. DataTorrent RTS is currently maintained by engineers
as a closed source project. Even though the engineers behind RTS are experienced software
engineers and are knowledge leaders in data-in-motion platforms, they have had little exposure
to the open source governance process. Customers are currently running applications based
on DataTorrent RTS in production.
  
  == Rationale ==
- Big data applications written for non-Hadoop platforms typically require major rewrites
 to get them to work with Hadoop. This rewriting creates a significant bottleneck in terms
of resources (expertise) which in turn jeopardizes the viability of such an endeavour. It
is hard enough to acquire big data expertise, demanding additional expertise to do a major
code conversion makes it a very hard problem for projects to successfully migrate to Hadoop.
Also due to batch processing nature of Hadoop’s MapReduce paradigm, users often have to
wait tens of minutes to see results and act on them due to various delays in data flow. DataTorrent’s
RTS data-in-motion architecture is designed to address this problem. It enables even the non
big data developer to write code and operate it in a scalable, fault tolerant manner. The
big data-in-motion architecture of DataTorrent’s RTS enables ease of integration into current
enterprise infrastructure. This goal was achieved by keeping the API simple and empowering
users to put in the connector code as is (or with minimal changes). 
+ Big data applications written for non-Hadoop platforms typically require major rewrites
 to get them to work with Hadoop. This rewriting creates a significant bottleneck in terms
of resources (expertise) which in turn jeopardizes the viability of such an endeavour. It
is hard enough to acquire big data expertise, demanding additional expertise to do a major
code conversion makes it a very hard problem for projects to successfully migrate to Hadoop.
Also, due to the batch processing nature of Hadoop’s MapReduce paradigm, users often have
to wait tens of minutes to see results and act on them due to various delays in data flow.
DataTorrent’s RTS data-in-motion architecture is designed to address this problem. It enables
even the non big data developer to write code and operate it in a scalable, fault tolerant
manner. The big data-in-motion architecture of DataTorrent’s RTS enables ease of integration
into current enterprise infrastructure. This goal was achieved by keeping the API simple and
empowering users to put in the connector code as is (or with minimal changes). 
  
- Malhar is a manifestation of this reality, and we or the customer engineers were able to
create these connectors within a day or so if not within a week. Connectors include those
to message bus(es), file systems, databases, other protocols, and more continue to be added.
Over a period of time we expect users to simply pick a connector that already exists in Malhar
and get going on integration into their current enterprise infrastructure. Within the data-in-motion
architecture a stream application is one with connector(s) to say Kafka, JMS, or Flume; while
batch application is one with connector(s) to HDFS, HBase, FTP, NFS, S3n etc. This allows
usage of the platform for both stream as well as batch processing with same business logic
as is. A complete separation of user written application code from all operational aspects
of the system significantly, and support code for YARN significantly expands the potential
use cases that can migrate to use Hadoop.
+ Malhar is a manifestation of this reality, and we or the customer engineers were able to
create these connectors within a day or so if not within a week. Connectors include those
to integrate with message bus(es), file systems, databases, other protocols, and more continue
to be added. Over a period of time we expect users to simply pick a connector that already
exists in Malhar and quickly begin integrating with their current enterprise infrastructure.
Within the data-in-motion architecture a stream application is one with connector(s) to say
Kafka, JMS, or Flume; while a batch application is one with connector(s) to HDFS, HBase, FTP,
NFS, S3n etc. This allows usage of the platform for both stream as well as batch processing
with same business logic. Complete separation of user written application code from all operational
aspects of the system, as well as support code for YARN, significantly expands the potential
use cases that can migrate to use Hadoop.
  
- Apex will enable Hadoop eco-system to migrate a lot more use cases. It will enable Hadoop
eco-system to deliver on a promise to rapidly transform current IT infrastructure. Apex will
help in significantly increasing productization of big data projects. One of the main barometer
of success in the Hadoop eco-system is significant reduction of time to market for big data
applications migrating to Hadoop. We believe that Apex will be one of the platforms that will
enable users to extract value from big data, by reducing time to market. This rapid innovation
can be optimally achieved through a vibrant, diverse, self-governed community collectively
innovating around Apex and Malhar library, while at the same time cross-pollinating with various
other big data platforms. ASF is an ideal place to meet this goal.
+ Apex will enable Hadoop eco-system to migrate a lot more use cases. It will enable the Hadoop
eco-system to deliver on a promise to rapidly transform current IT infrastructure. Apex will
help in significantly increasing productization of big data projects. One of the main barometers
of success in the Hadoop eco-system is significant reduction of time to market for big data
applications migrating to Hadoop. We believe that Apex will be one of the platforms that will
enable users to extract value from big data, by reducing time to market. This rapid innovation
can be optimally achieved through a vibrant, diverse, self-governed community collectively
innovating around Apex and the Malhar library, while at the same time cross-pollinating with
various other big data platforms. ASF is an ideal place to meet this goal.
  
  == Initial Goals ==
- Our initial goals are to bring Apex and Malhar repositories into the ASF, adapt internal
engineering processes to the open development, and foster a collaborative development model
the "Apache Way." DataTorrent plans to develop new functionality in an open, community-driven
way. To get there, the existing internal build, test and release processes will be refactored
to support open development. We already have an active user community on google groups that
we intend to migrate to Apache.
+ Our initial goals are to bring Apex and Malhar repositories into the ASF, adapt internal
engineering processes to open development, and foster a collaborative development model in
accordance with the "Apache Way." DataTorrent plans to develop new functionality in an open,
community-driven way. To get there, the existing internal build, test and release processes
will be refactored to support open development. We already have an active user community on
google groups that we intend to migrate to Apache.
  
  == Current Status ==
- Currently, the project Apex code base is available under Apache 2.0 license (https://github.com/DataTorrent/Apex).
Project Malhar code base is available under Apache 2.0 license (https://github.com/DataTorrent/Malhar).
Project Malhar was open sourced 2 years ago which should make it easy for the project Malhar
team to adapt to an  open, collaborative, and meritocratic environment. Contributors of Malhar
are employees of DataTorrent or have agreed to the shift to Apache. Project Apex, in contrast,
was developed as a proprietary, closed-source product, but the internal engineering practices
adopted by the development team were common to Malhar, and will should lend themselves well
to an open  environment. DataTorrent plans to execute a software grant agreement as part of
the launch of the incubation of Apex as an Apache project.
+ Currently, the project Apex code base is available under Apache 2.0 license (https://github.com/DataTorrent/Apex).
Project Malhar code base is available under Apache 2.0 license (https://github.com/DataTorrent/Malhar).
Project Malhar was open sourced 2 years ago which should make it easy for the project Malhar
team to adapt to an  open, collaborative, and meritocratic environment. Contributors of Malhar
are employees of DataTorrent or have agreed to the shift to Apache. Project Apex, in contrast,
was developed as a proprietary, closed-source product, but the internal engineering practices
adopted by the development team were common to Malhar, and should lend themselves well to
an open  environment. DataTorrent plans to execute a software grant agreement as part of the
launch of the incubation of Apex as an Apache project.
  
  The DataTorrent team has always focused on building a robust end user community of paying
and non-paying customers. We think that the existing community centered around the existing
google groups mailing list should be relatively easy to transform into an Apache-style community
including both users and developers. 
  
@@ -80, +80 @@

  The tools and development practices in place for the DataTorrent RTS and Malhar products
are compatible with the ASF infrastructure and thus we do not anticipate any on-boarding pains.
Migration from the current GitHub repository is also expected to be straightforward.
  
  === Orphaned products ===
- DataTorrent is fully committed to DataTorrent Apex and Malhar and the product will continue
to be based on the Apex project. Moreover, DataTorrent has a vested interest in making Apex
succeed by driving its close integration with sister ASF projects. We expect this to further
reduces the risk of orphaning the product.
+ DataTorrent is fully committed to DataTorrent Apex and Malhar and the product will continue
to be based on the Apex project. Moreover, DataTorrent has a vested interest in making Apex
succeed by driving its close integration with sister ASF projects. We expect this to further
reduce the risk of orphaning the product.
  
  === Inexperience with Open Source ===
- DataTorrent has embraced open source software by open sourcing Malhar project under Apache
2.0 license. DataTorrent team includes veterans from Yahoo! Hadoop team. Although some of
the initial committers have not been developers on an entirely open source, community-driven
project, we expect to bring to bear the open development practices of Malhar to Apex project.
Additionally, several ASF veterans agreed to mentor the project and are listed in this proposal.
The project will rely on their guidance and collective wisdom to quickly transition the entire
team of initial committers towards practicing the Apache Way. DataTorrent is also driving
Kafka on YARN (KOYA) initiative.
+ DataTorrent has embraced open source software by open sourcing Malhar project under Apache
2.0 license. The DataTorrent team includes veterans from the Yahoo! Hadoop team. Although
some of the initial committers have not been developers on an entirely open source, community-driven
project, we expect to bring to bear the open development practices of Malhar to the Apex project.
Additionally, several ASF veterans agreed to mentor the project and are listed in this proposal.
The project will rely on their guidance and collective wisdom to quickly transition the entire
team of initial committers towards practicing the Apache Way. DataTorrent is also driving
the Kafka on YARN (KOYA) initiative.
  
  === Homogeneous Developers ===
  While most of the initial committers are employed by DataTorrent, we have already seen a
healthy level of interest from our existing customers and partners. We intend to convert that
interest directly into participation and will be investing in activities to recruit additional
committers from other companies.
@@ -99, +99 @@

  
  
  == Documentation ==
- See documentation for the current state of the project documentation available as part of
the GitHub repositories - https://github.com/DataTorrent/Apex; https://github.com/DataTorrent/Malhar.
In addition a list of demos that serve as a how to guide is available at https://github.com/DataTorrent/Malhar/tree/master/demos
+ See documentation for the current state of the project documentation available as part of
the GitHub repositories - https://github.com/DataTorrent/Apex; https://github.com/DataTorrent/Malhar.
In addition a list of demos that serve as a how to guide are available at https://github.com/DataTorrent/Malhar/tree/master/demos
  
  == Initial Source ==
  DataTorrent has released the source code for Apex under Apache 2.0 License at https://github.com/DataTorrent/Apex,
and that of Malhar under Apache 2.0 licence at https://github.com/DataTorrent/Malhar. We encourage
ASF community members interested in this proposal to download the source code, review it and
try out the software.
@@ -108, +108 @@

  As soon as Apex is approved to join Apache Incubator, DataTorrent will execute a Software
Grant Agreement and the source code will be transitioned onto ASF infrastructure. The code
is already licensed under the  Apache Software License, version 2.0. We know of no legal encumberments
that would inhibit the transfer of source code to the ASF.
  
  == External Dependencies ==
- All dependencies fall under the permissive licenses categories, or weak copy left (http://www.apache.org/legal/resolved.html#category-b).
We intend to remove dependencies GPL licensed technologies on which APex or Malhar depend
on. These technologies are optional and have been marked as such.
+ All dependencies fall under the permissive licenses categories, or weak copy left (http://www.apache.org/legal/resolved.html#category-b).
We intend to remove the dependencies on GPL licensed technologies on which APex or Malhar
depend. These technologies are optional and have been marked as such.
  
  Embedded dependencies (relocated):
     * None
@@ -228, +228 @@

     * Means of setting up regular builds for Apex on builds.apache.org
     * Means of setting up regular builds for Malhar on builds.apache.org
  
+ 
+ ---- /!\ '''Edit conflict - other version:''' ----
  === Rationale for Malhar and Apex having separate git and jira ===
  The decision to proposal a single community was taken after a lot of thought, our proposal
is for ASF incubator to enable us to try this out.
  
  So far we have managed Malhar and Apex as two repos and two jiras. Both code bases are released
under Apache 2.0 and are proposed for incubation. In terms of our vision to enable innovation
around a native YARN data-in-motion that unifies stream processing as well as batch processing
Malhar and Apex go hand in hand. Apex has base API that consists of java api (functional),
and attributes (operability). Malhar is a manifestation of this api, but from user perspective,
Malhar is itself an API to leverage business logic. Over past three years we have found that
the cadence of release and api changes in Malhar is much rapid than Apex and it was operationally
much easier to separate them into their own repos. For example often Malhar developers locked
into previous stable build of Apex. We however do not believe in two levels of committers.
We believe there should be one community that works across both and innovates with ideas that
Malhar and Apex combined provide the value proposition. We are proposing that Apache incubation
process help us to foster development of one community (mailing list, committers), and a yet
be ok with two repos. We are proposing that this be taken up during incubation. Community
will learn if this works. The decision on whether to split them into two projects be taken
after the learning curve during incubation.
  
+ 
+ ---- /!\ '''Edit conflict - your version:''' ----
+ 
+ ---- /!\ '''End of edit conflict''' ----
  == Initial Committers ==
     * Roma Ahuja (rahuja at directv dot com)
     * Isha Arkatkar (isha at datatorrent dot com)

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message