incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "RheemProposal" by venkatesh
Date Sun, 11 Jun 2017 01:21:53 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "RheemProposal" page has been changed by venkatesh:
https://wiki.apache.org/incubator/RheemProposal

Comment:
Initial version

New page:
= Apache Rheem Proposal =
 
= Abstract =
Rheem is a cross-platform data analytics tool. Its goal is to decouple the business logic
of data analytics applications from concrete data processing platforms, such as Apache Hadoop
or Apache Spark. Hence, it tames the complexity that arises from the “Cambrian explosion”
of novel data analytics tools that we currently witness. 
 
= Proposal =
Rheem is a cross-platform system that provides an abstraction over data processing systems
to free users from the burdens of: performing tedious and costly data migration and integration
tasks to run their applications; and choosing the right data processing platforms for their
applications. To achieve this, Rheem: (1) provides an abstraction on top of existing data
processing platforms that allows users to specify their data analytics tasks in a form of
a DAG of operators; (2) comes with a cross-platform optimizer for automating the selection
of suitable/efficient platforms; and (3) and finally takes care of executing the optimized
plan, including communication across platforms. In summary, Rheem has the following salient
features:
 * Flexible Data Model -- It considers a flexible and simple data model based on data quanta.
A data quantum is an atomic processing unit in the system, that can represent a large spectrum
of data formats, such as data points for a machine learning application, tuples for a database
application, or RDF triples. Hence, Rheem is able to express a wide range of data analytics
tasks.
 * Platform independence -- It provides a simple interface (currently Java and Scala) that
is inspired by established programming models, such as that of Apache Spark and Apache Flink.
Users represent their data analytic tasks as a DAG (Rheem plan), where vertices correspond
to Rheem operators and edges represent data flows (data quanta flowing) among these operators.
A Rheem operator defines a particular kind of data transformation over an input data quantum,
ranging from basic functionality (e.g., transformations, filters, joins) to complex, extensible
tasks (e.g., PageRank).
 * Cross-platform execution -- Besides running a data analytic task on any data processing
platform, it also comes with an optimizer that can decide to execute a single data analytic
task using multiple data processing platforms. This allows for exploiting the capabilities
of different data processing platforms to perform complex data analytic tasks more efficiently.
 * Self-tuning UDF-based cost model -- Its optimizer uses a cost model fully based on UDFs.
This not only enables Rheem to learn the cost functions of newly added data processing platforms,
but also allows developers to tune the optimizer at will.
 * Extensibility -- It treats data processing platforms as plugins to allow users (developers)
to easily incorporate new data processing platform into the system. This is achieved by exposing
the functionalities of data processing platforms as operators (execution operators). The same
approach is followed at the Rheem interface, where users can also extend Rheem capabilities,
i.e., the operators, easily.
 
We plan to work on the stability of all these features as well as extending Rheem with more
advanced features. Furthermore, Rheem currently supports Apache Spark, Standalone Java, GraphChi,
and Relational databases via JDBC. We plan to incorporate more data processing platforms,
such as Apache Flink and Apache Hadoop.
 
= Background =
Many organizations and companies collect or produce large variety of data to apply data analytics
over them. This is because insights from data rapidly allow them to make better decisions.
Thus, the pursuit for efficient and scalable data analytics as well as the one-size-does-not-fit-all
philosophy has given rise to a plethora of data processing platforms. Examples of these specialized
processing platforms range from DBMSs to MapReduce-like platforms.
 
However, today's data analytics are moving beyond the limits of a single data processing platform.
More and more applications need to perform complex data analytics over several data processing
platforms. For example, (i) IBM reported that North York hospital needs to process 50 diverse
datasets, which are on a dozen different internal systems, (ii) oil & gas companies stated
they need to process large amounts of data they produce everyday, e.g., a single oil company
can produce more than 1.5TB of diverse (structured and unstructured) data per day, (iii) Fortune
magazine stated that airlines need to analyze large datasets, which are produced by different
departments, are of different data formats, and reside on multiple data sources, to produce
global reports for decision makers, and (iv) Hewlett Packard has claimed that, according to
its customer portfolio, business intelligence typically require a single analytics pipeline
using different processing platforms at different parts of the pipeline. These are just few
examples of emerging applications that require a diversity of data processing platforms.
 
Today, developers have to deal with this myriad of data processing platforms. That is, they
have to choose the right data processing platform for their applications (or data analytic
tasks) and to familiarize with the intricacies of the different platforms to achieve high
efficiency and scalability. Several systems have also appeared with the goal of helping users
to easily glue several platforms together, such as Apache Drill, PrestoDB, and Luigi. Nevertheless,
all these systems still require quite good expertise from users to decide which data processing
platforms to use for the data analytic task at hand. In consequence, great engineering effort
is required to unify the data from various sources, to combine the processing capabilities
of different platforms, and to maintain those applications, so as to unleash the full potential
of the data. In the worst case, such applications are not built in the first place, as it
seems too much of a daunting endeavor.
 
It is evident that there is an urgent need to release developers from the burden of knowing
all the intricacies of choosings and glueing together data processing platforms for supporting
their applications (data analytic tasks). Developers must focus only on the logics of their
applications.
 
= Rationale =
It is evident that there is an urgent need to release developers from the burden of knowing
all the intricacies of choosing and glueing together data processing platforms for supporting
their applications (data analytic tasks). Developers must focus only on the logics of their
applications. Surprisingly, there is no open source system trying to satisfy this urgent need.
Rheem aims at filling this gap. It copes with this urgent need by providing both a common
interface over data processing platforms and an optimizer to execute a data analytic tasks
on the right data processing platform(s) seamlessly. As Apache is the place where most of
the important big data systems are, we then consider Apache as the right place for Rheem.
 
= Current Status =
The current version of Rheem (v0.2.1) was co-developed by staff, students, and interns at
the Qatar Computing Research Institute (QCRI) and the Hasso-Plattner Institute (HPI). The
projects was initiated at and sponsored by QCRI in 2015 with the goal of freeing data scientists
and developers from the intricacies of data processing platforms to support their analytic
tasks. The first open source release of Rheem was made only one year and a half later, in
June 13th of 2016, under the Apache Software License 2.0. Since then we have been very active
and have made two more releases of Rheem, the latest release was done on November 30th of
2016. We plan to have our next release by June 2017.
 
= Meritocracy =

All current Rheem developers are familiar with this development process at Apache and are
currently trying to follow this meritocracy process as much as possible. For example, Rheem
already follows a committer principle where any pull request is analyzed by at least one Rheem
core developer. This was one of the reasons of choosing Apache for Rheem as we all want to
encourage and keep this style of development for Rheem. 
 
= Community =

Rheem started as a pure research project at QCRI, but it quickly started developing into a
community. People from HPI quickly joined our efforts almost from the very beginning to make
this project a reality. Recently, the Pontifical Catholic University of Valparaiso (PUCV)
in Chile has also joined our efforts for developing Rheem. Currently, we are intensively seeking
to further develop both developer and user communities. To keep broaden the community, we
plan to also exploit our ongoing academic collaborations with the MIT. For instance, Rheem
is already being utilized for accessing multiple data sources in the context of a large data
analytics project led by MIT and QCRI. We also believe that Rheem's extensible architecture
(i.e., adding new operators and platforms) will further encourage community participation.
Furthermore, we are currently collaborating with a large airline company in the middle-east
in the context of Rheem. During incubation we plan to have Rheem adopted by this airline company
and will explicitly seek more industrial participation.
 
== Core Developers ==

The initial developers of the project are diverse, they are from three different institutions
(QCRI, HPI, and PUCV). Although 50% of the developers are from QCRI, due to the origins of
the project, we will work aggressively to change this during the incubation by recruiting
more developers from other institutions.
 
== Alignment ==
We believe Apache is the most natural home for taking Rheem to the next level. Apache is currently
hosting the most important big data systems. Hadoop, Spark, Flink, HBase, Hive, Tez, Reef,
Storm, Drill, and Ignite are just some examples of these technologies. Rheem fills a significant
gap -- it provides a common abstraction for all these platforms and decides on which platforms
to run a single data analytic task -- that exist  in the big data open source world. Rheem
is now being developed following the Apache-style development model. Also, it is well-aligned
with the Apache principle of building a community to impact the big data community.
 
= Known Risks =

== Orphaned Products ==

Currently, Rheem is one of the important projects at QCRI at long term. As a result, a team
of research scientists and engineers are working on full time basis on this project. Thus,
the risk of Rheem orphaned is relatively very low. Still, people outside QCRI (from HPI) has
also joined the project, which makes the risk of abandoning the project even lower. The PUCV
in Chile is also beginning to contribute to the code base and to develop a declarative query
language on top of Rheem. The project is constantly being monitored by email and frequent
Skype meeting as well as by weekly meetings physically at QCRI for QCRI people. Additionally,
at the end of each year, we meet to discuss the status of the project as well as to plan the
most important aspects we should work on during the year after.
 
== Inexperience with Open Source ==

Rheem quickly started being developed in open source between QCRI and HPI under the Apache
Software License 2.0. The source code is available on Github. Also few of the initial committers
have contributed to other open source projects:TODO
 
== Homogenous Developers ==

The initial committers are already geographically distributed among Qatar (QCRI), Germany
(HPI), and Chile (PUCV). During incubation, one of our main goals is to increase the heterogeneity
of the current community and we will work hard to achieve it.
 
== Reliance on Salaried Developers ==

Rheem is already being developed by a mix of full time and volunteer time. Only 3 of the initial
committers are working full time on this project. Although these three committers are paid
to contribute to this project, they are all passionate about the project. The rest of initial
committers are faculty, staff, and students working on the project on volunteer basis. So,
we are confident that the project will not decrease its development pace. Furthermore, we
are committed to recruit additional committers to significantly increase the development pace
of the project.
 
== Relationships with Other Apache Products ==

Rheem is somehow related to Apache Spark as its developing interface is inspired from Spark.
In contrast to Spark, Rheem is not a data processing platform, but a mediator between user
applications and data processing platforms. In this sense, Rheem is similar to the Apache
Drill project, and Apache Beam. However, Rheem significantly differs from Apache Drill in
two main aspects. First, Apache Drill provides only a common interface to query multiple data
storages and hence users have to specify in their query the data to fetch. Then, Apache Drill
translates the query to the processing platforms where the data is stored, e.g. into mongoDB
query representation. In contrast, in Rheem, users only specify the data path and Rheem decides
which is the best (performance-wise) data processing platforms to use to process such data.
Second, the query interface in Apache Drill is SQL. Rheem uses an interface based on operators
forming DAGs. In this latter point, we are currently developing a PIGLatin-like query language
for Rheem. In addition, in contrast to Apache Beam, Rheem not only allows users to use multiple
data processing platforms at the same time, but also it provides an optimizer to choose the
most efficient platform for the task at hand. In Apache Beam, users have to specify an appropriate
runner (platform). Apache Beam simply aims at abstracting the pipeline of the data processing.

 
Given these similarities with the two Apache projects mentioned above, we are looking forward
to collaborating with those communities. Still, we are open and would also love to collaborate
with other Apache communities as well.
 
== An Excessive Fascination with the Apache Brand ==

Rheem solves a real problem that currently users and developers have to deal with at a high
cost: monetary cost, high design and development efforts, and very time consuming. Therefore,
we believe that Rheem can be successful in building a large community around it. We are convinced
that the Apache brand and community process will significantly help us in building such a
community and to establish the project in the long-term. We simply believe that ASF is the
right home for Rheem to achieve this.
 
== Documentation ==

Further details, documentation, and publications related to Rheem can be found at http://da.qcri.org/rheem/
 
== Initial Source ==

The current source code of Rheem resides in Github: https://github.com/rheem-ecosystem/rheem

 
== External Dependencies ==

Rheem depends on the following Apache projects:
 * Maven
 * HDFS
 * Hadoop Common
 * Spark
 
Rheem depends on the following other open source projects organized by license:
 * slf4j
 * JSON
 * Snake Yaml
 * GraphChi
 * PostgreSQL
 * SQLite3
 
= Required Resources =

== Mailing lists ==
 * private@rheem.incubator.apache.org
 * dev@rheem.incubator.apache.org
 * commits@rheem.incubator.apache.org
 
== Subversion Directory ==

Git is the preferred source control system:
git://git.apache.org/repos/asf/incubator/rheem
 
== Issue Tracking ==

https://issues.apache.org/jira/browse/RHEEM
 
== Initial Committers ==

The following list gives the planned initial committers (in alphabetical order):
 * Bertty Contreras-Rojas <bertty.contreras.r@mail.pucv.cl>
 * Yasser Idris <yidris@hbku.edu.qa>
 * Zoi Kaoudi <zkaoudi@hbku.edu.qa>
 * Sebastian Kruse <Sebastian.Kruse@hpi.de>
 * Essam Mansour <emansour@hbku.edu.qa>
 * Wenceslao Palma-Muñoz <wenceslao.palma@pucv.cl>
 * Jorge-Arnulfo Quiane-Ruiz <jquianeruiz@hbku.edu.qa>
 * Anis Troudi <atroudi@hbku.edu.qa>
 
== Affiliations ==

 * Qatar Computing Research Institute, HBKU, Qatar (QCRI)
   * Yasser Idris
   * Zoi Kaoudi
   * Essam Mansour
   * Jorge-Arnulfo Quiane-Ruiz
   * Anis Troudi
 
 * Hasso-Plattner Institute, Germany (HPI)
   * Sebastian Kruse
 
 * Pontifical Catholic University of Valparaiso, Chile (PUCV)
   * Bertty Contreras-Rojas
   * Wenceslao Palma-Muñoz
 
= Sponsors =
 
== Champion ==
 
 * Venkatesh Seetharam (venkatesh)
 
== Nominated Mentors ==
 
 * Venkatesh Seetharam (venkatesh)
 
== Sponsoring Entity ==
 
The Apache Incubator
 
== Footnotes ==

 * 1 - Papers detailing Rheem are available at 

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message