incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "ApexProposal" by AmolKekre
Date Tue, 04 Aug 2015 06:44:02 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "ApexProposal" page has been changed by AmolKekre:
https://wiki.apache.org/incubator/ApexProposal?action=diff&rev1=13&rev2=14

  == Abstract ==
  Apex is an enterprise grade native YARN big data-in-motion platform that unifies stream
processing as well as batch processing. Apex processes big data in-motion in a highly scalable,
highly performant, fault tolerant, stateful, secure, distributed, and an easily operable way.
It provides a simple API that enables users to write or re-use generic Java code, thereby
lowering the expertise needed to write big data applications.
  
- Functional and operational specifications are separated. Apex is designed in a way to enable
users to write their own code (aka user defined functions) as is and leave all operability
to the platform. The API is very simple and is designed to allow users to drop in their code
as is. The platform mainly deals with operability and treats functional code as a black box.
Operability includes fault tolerance, scalability, security, ease of use, metrics api, webservices
etc. In other words there is no separation of UDF (user defined functions), as all functional
code is UDF. This frees users to focus on functional development, and lets platform provide
operability support. The same code runs as is with different operability attributes. The data-in-motion
architecture of Apex unifies stream as well as batch processing in a single platform. Since
Apex is a native Yarn application, it leverages all the components of Yarn without duplication.
Apex was developed with Yarn in mind and has no overlapping components/functionality with
Yarn. 
+ Functional and operational specifications are separated. Apex is designed in a way to enable
users to write their own code (aka user defined functions) as is and leave all operability
to the platform. The API is very simple and is designed to allow users to drop in their code
as is. The platform mainly deals with operability and treats functional code as a black box.
Operability includes fault tolerance, scalability, security, ease of use, metrics api, webservices
etc. In other words there is no separation of UDF (user defined functions), as all functional
code is UDF. This frees users to focus on functional development, and lets platform provide
operability support. The same code runs as is with different operability attributes. The data-in-motion
architecture of Apex unifies stream as well as batch processing in a single platform. Since
Apex is a native YARN application, it leverages all the components of YARN without duplication.
Apex was developed with YARN in mind and has no overlapping components/functionality with
YARN. 
  
  The Apex platform is supplemented by project Malhar which is a library of operators that
implement common business logic functions needed by customers who want to quickly develop
applications. These operators provide access to HDFS, S3, NFS, FTP, and other file systems;
 Kafka, ActiveMQ, RabbitMQ, JMS, and other message systems; MySql, Cassandra, MongoDB, Redis,
HBase, CouchDB and other databases along with JDBC connectors. The Malhar library also includes
a host of other common business logic patterns that help users to significantly reduce the
time it takes to go into production. Ease of integration with all other big data technologies
is one of the primary missions of Malhar.
  
@@ -11, +11 @@

  The goal of this proposal is to establish the core engine of DataTorrent RTS product as
a Apache Software Foundation (ASF) project in order to build a vibrant, diverse, and self-governed
open source community around the technology. DataTorrent will continue to sell management
tools, application building tools, easy to use big data applications, and custom high end
business logic operators. This proposal covers the Apex source code (written in Java), Apex
documentation and other materials currently available on https://github.com/DataTorrent/Apex.
This proposal also covers the Malhar source code (written in Java), Malhar documentation,
and other materials currently available on https://github.com/DataTorrent/Malhar. We have
done a trademark check on the name Apex, and have concluded that the Apex name is likely to
be a suitable project name. 
  
  == Background ==
- DataTorrent RTS is a mature and robust product developed as a native Yarn application. RTS
1.0 was launched in summer of 2014; RTS 2.0 was launched in Jan 2015. Both were well received
by customers. RTS is among the first enterprise grade platform that was developed from ground
up as native Yarn application. DataTorrent RTS is currently maintained by engineers as a closed
source project. Even though the engineers behind RTS are experienced software engineers and
are knowledge leaders in data-in-motion platforms, they have had little exposure to the open
source governance process. Customers are currently running applications based on DataTorrent
RTS in production.
+ DataTorrent RTS is a mature and robust product developed as a native YARN application. RTS
1.0 was launched in summer of 2014; RTS 2.0 was launched in Jan 2015. Both were well received
by customers. RTS is among the first enterprise grade platform that was developed from ground
up as native YARN application. DataTorrent RTS is currently maintained by engineers as a closed
source project. Even though the engineers behind RTS are experienced software engineers and
are knowledge leaders in data-in-motion platforms, they have had little exposure to the open
source governance process. Customers are currently running applications based on DataTorrent
RTS in production.
  
  == Rationale ==
  Big data applications written for non-Hadoop platforms typically require major rewrites
 to get them to work with Hadoop. This rewriting creates a significant bottleneck in terms
of resources (expertise) which in turn jeopardizes the viability of such an endeavour. It
is hard enough to acquire big data expertise, demanding additional expertise to do a major
code conversion makes it a very hard problem for projects to successfully migrate to Hadoop.
Also due to batch processing nature of Hadoop’s MapReduce paradigm, users often have to
wait tens of minutes to see results and act on them due to various delays in data flow. DataTorrent’s
RTS data-in-motion architecture is designed to address this problem. It enables even the non
big data developer to write code and operate it in a scalable, fault tolerant manner. The
big data-in-motion architecture of DataTorrent’s RTS enables ease of integration into current
enterprise infrastructure. This goal was achieved by keeping the API simple and empowering
users to put in the connector code as is (or with minimal changes). 
@@ -83, +83 @@

  DataTorrent is fully committed to DataTorrent Apex and Malhar and the product will continue
to be based on the Apex project. Moreover, DataTorrent has a vested interest in making Apex
succeed by driving its close integration with sister ASF projects. We expect this to further
reduces the risk of orphaning the product.
  
  === Inexperience with Open Source ===
- DataTorrent has embraced open source software by open sourcing Malhar project under Apache
2.0 license. DataTorrent team includes veterans from Yahoo! Hadoop team. Although some of
the initial committers have not been developers on an entirely open source, community-driven
project, we expect to bring to bear the open development practices of Malhar to Apex project.
Additionally, several ASF veterans agreed to mentor the project and are listed in this proposal.
The project will rely on their guidance and collective wisdom to quickly transition the entire
team of initial committers towards practicing the Apache Way. DataTorrent is also driving
Kafka on Yarn (KOYA) initiative.
+ DataTorrent has embraced open source software by open sourcing Malhar project under Apache
2.0 license. DataTorrent team includes veterans from Yahoo! Hadoop team. Although some of
the initial committers have not been developers on an entirely open source, community-driven
project, we expect to bring to bear the open development practices of Malhar to Apex project.
Additionally, several ASF veterans agreed to mentor the project and are listed in this proposal.
The project will rely on their guidance and collective wisdom to quickly transition the entire
team of initial committers towards practicing the Apache Way. DataTorrent is also driving
Kafka on YARN (KOYA) initiative.
  
  === Homogeneous Developers ===
  While most of the initial committers are employed by DataTorrent, we have already seen a
healthy level of interest from our existing customers and partners. We intend to convert that
interest directly into participation and will be investing in activities to recruit additional
committers from other companies.
@@ -92, +92 @@

  Most of the contributors are paid to work in the Big Data space. While they might wander
from their current employers, they are unlikely to venture far from their core expertises
and thus will continue to be engaged with the project regardless of their current employers.
  
  === Relationships with Other Apache Products ===
- As mentioned in the Alignment section, Apex may consider various degrees of integration
and code exchange with Apache Hadoop (Yarn and HDFS), Apache Kafka, Apache HBase, Apache Flume,
Apache Cassandra, Apache Accumulo, Apache Tez, Apache Hive, Apache Pig, Apache Storm, Apache
Samza, Apache Spark, Apache Slider. Given the success that the DataTorrent RTS product enjoyed,
we expect integration points to be inside and outside the project. We look forward to collaborating
with these communities as well as other communities under the Apache umbrella.
+ As mentioned in the Alignment section, Apex may consider various degrees of integration
and code exchange with Apache Hadoop (YARN and HDFS), Apache Kafka, Apache HBase, Apache Flume,
Apache Cassandra, Apache Accumulo, Apache Tez, Apache Hive, Apache Pig, Apache Storm, Apache
Samza, Apache Spark, Apache Slider. Given the success that the DataTorrent RTS product enjoyed,
we expect integration points to be inside and outside the project. We look forward to collaborating
with these communities as well as other communities under the Apache umbrella.
  
  === An Excessive Fascination with the Apache Brand ===
  While we intend to leverage the Apache ‘branding’ when talking to other projects as
testament of our project’s ‘neutrality’, we have no plans for making use of Apache brand
in press releases nor posting billboards advertising acceptance of Apex into Apache Incubator.

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message