incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Atri Sharma <a...@apache.org>
Subject [DISCUSS] Concerted Incubation Proposal
Date Tue, 06 Oct 2015 18:47:44 GMT
Hi All,

We would like to propose accepting Concerted, a DIY platform for building
highly scalable and performance oriented in memory support engines for big
data platforms into ASF Incubator.

Please find proposal at following link:
https://wiki.apache.org/incubator/ConcertedProposal

Please find proposal also at end of email.

We think that this project will allow ASF to have a project which is
focused on lightweight in memory engines while allowing custom engines and
solving problems for a lot of big data platforms by being available and
accessible on demand in the custom format that the user wants.

Please see below link for earlier discussion on the same:

http://mail-archives.apache.org/mod_mbox/incubator-general/201510.mbox/%3CCA+ULb+t86CGhsgADbh5RVKKOeG+wPTYF-S-etF-aGs0AsVOvOA@mail.gmail.com%3E

We would be glad for feedback and comments. People willing to help us are
most welcome!

Regards,

Atri



= Abstract =

Concerted is an in memory write less read more engine aimed to provide
extreme read performance with very high degree of concurrency and
scalability and focus on minimizing own resource footprint.

= Proposal =
Concerted is built on the principal that a new type of workload is
dominating the scene and is now needed to be supported. These are the large
data set analytical workloads being analyzed or used on large clusters or
high power machines. Large analytical workloads depend on the ability to
query large data sets efficiently and in high concurrency while maintaining
semantics such as immediate consistency. An in memory engine designed to
support extreme read queries while providing support for aggregation
through various features (such as multidimensional representation of
tuples) will accelerate many usecases around large scale analytics.

Concerted believes that best understanding of user application lies with
user application developer. The need for massive read scaling should be on
demand and should be flexible to the level that user can decide as to which
representation and access of data suits his/her current requirements.
Hence, Concerted is not built in a traditional client/server model.
Concerted provides users with an API which can be used to load, read,
update and delete data. User chooses which data structure has to be used
for his current requirements. All API access is covered by Concerted's
internal systems like lock manager, transaction manager and cache manager
which ensure that reads scale to high level in every API call.

Concerted is a Do It Yourself in memory platform for making in memory
supporting engines. The use case we think of is supporting big data
warehouses like Hive, but there are endless use cases for a custom, highly
scalable in memory platform.

The goal of this proposal is to leverage an existing code base available on
Github and licensed under the Apache License 2.0 to build a community
around the project. Currently the community consists of existing hackers of
Concerted as well as people who have been following and associated with the
project since a while as well as database experts who are excited about
building a project like this. We are hoping that entering into Apache would
help us attract more contributors as well as connect with existing big data
projects like Apache Hive, Apache HAWQ, Apache Storm, Apache Tajo, Apache
Spark, Apache Geode to leverage their community base while assisting in
their use cases with Concerted. We had a discussion with founders of Apache
Tajo and they showed interest in using Concerted for some of their use
cases.
= Background =
Relational databases were built with the cost of physical memory in mind.
The cost is no longer very relevant and physical memory is now available on
demand. Another driving factor behind Concerted is that there is a paradigm
shift with big data coming into picture. Disk IO speeds are more of a
bottleneck than ever before. Combining the read dominance of analytical
workload with the speed of in memory structures, Concerted fits the current
scene. Also, supporting OLAP workloads with in memory support for faster
read constant queries and joins will be useful.

= Rationale =
As explained above, large analytical workloads need an in memory
lightweight engine which supports massive read concurrency, ground level
support for aggregations and analytics, extreme scalability and high read
performance, along with the engine being very light itself. Concerted aims
to solve these needs. Concerted is designed and built with three goals as
objectives:


Performance
    To provide high performance access to data from a large number of rows,
Concerted uses efficient representation and in memory indexing of data
coupled with high performance transactions, custom transactions and
lightweight locking and lockless techniques and an intelligent locking
manager.

Scalability
    Concerted is built with extreme concurrency and scalability in mind.

Efficiency
    Concerted aims to give expected performance under vast variety of
workloads and aims to have as low footprint as possible.

= Initial Goals =
The initial goal is to leverage an existing code base and invest in
building a community around the project. We anticipate a lot of initial
restructuring of the existing code so that it becomes easier to include new
contributors and minimize ramp up time. We plan to approach this
refactoring in a fully transparent, community-driven way thus starting to
practice the "Apache Way" governance model from the get go.

Various contributors are getting individual changes into branches in github
repository and our initial major goal will be to merge in all those changes
in master repository.

= Current Status =
Concerted is currently under restructuring to suit the needs of an open
source project. Current source is available at
https://github.com/atris/Concerted (Please note that updated codebase is
not yet present on github) Concerted is currently being licensed under
Apache License 2.0. Most of the code base is implemented in C and C++ and
has external dependencies listed later.

== Meritocracy ==

We plan to drive the technical roadmap and implementation in a fully
transparent, community-driven way soliciting feedback from all of the
community members and building a consensus-driven approach to evolving the
code base and the community itself. Users and new contributors will be
treated with respect and welcomed. By participating in the community and
providing quality patches/support that move the project forward,
contributors will earn merit. They also will be encouraged to provide
non-code contributions (documentation, events, community management, etc.)
and will gain merit for doing so. Those with a proven support and quality
track record will be encouraged to become committers.

== Community ==
In memory is the new cutting edge thing and a new community around
performance oriented systems and enhancing relational database performance
by having complete in memory OLTP engines will greatly benefit performance.
So we expect data warehousing projects and communities as well as projects
and companies looking for high performance OLTP performance. In addition,
Ingenium Data Systems is building products around Concerted and will have
salaried developers contribute to the project as part of job responsibility.

== Core Developers ==
Core developers are a diverse group of developers, many of which are very
experienced in open source and the Apache Hadoop ecosystem. Specifically,
Atri is an Apache Apex committer and Atri and Pavel are major contributors
to PostgreSQL project.Atri is also committer for other open source projects.

 * Amrish <amrishs AT ingeniumsys DOT com>
 * Nupur S <nupurs AT ingeniumsys DOT com>
 * Pavel Stehule <pavel DOT stehule AT gmail.com>
 * Atri Sharma <atri AT apache DOT org>
 * Nishith Singhal <nishsinghal AT gmail DOT com>
 * Michael Down <michael AT dowuk DOT com>
 * Vijayakumar Ramdoss <vijayakumar DOT ramdoss AT emc DOT com>
 * Wang Albert <albertwang87 AT gmail DOT com>
 * Hans-Jurgen Schonig <postgres AT cybertec DOT at>
 * Kris Popat <krispopat AT apache DOT org>
 * Ayrton Gomesz <com DOT ayrton AT gmail DOT com>

== Alignment ==
Concerted will be helpful to systems like Tajo which can benefit with in
memory structures optimized for heavy reads and joins (dimension tables).
In addition Concerted will benefit projects looking for in memory
relational database as a metadata store, which is the case for most of the
Apache Big Data projects. We expect Apache HAWQ (incubating), Apache Hive,
Apache Storm, Apache Tajo to be utilizing Concerted as a supporting engine.
For eg, a data warehouse built on HAWQ, Hive or Tajo can utilize Concerted
as an in memory engine for querying and joining dimensional tables.

= Known Risks =

== Orphaned Products ==
Most of the code is developed by a small group of core developers and this
may be a risk for orphaned product. However, the code base is simple as
compared to other open source projects and the interest level in Concerted
has risen exponentially over the years with many computer professionals
expressing interest in the project and doing some use cases of the
same.Specifically, there were some projects done around Concerted in JIIT,
Noida (an engineering school) and Wang is a student in Lehigh University
who has been following Concerted's progress over many years. The core
developers are aligned with this project and since the code base is simple,
future committers will have a quick ramp up and the risk shall be
mitigated. Besides, Ingenium Data Systems is launching a product based on
Concerted and will be having all its salaried developers contribute to
Concerted as a part of their job functions.

== Inexperience with Open Source ==
Most of the initial committers have experience working on open source
projects. In particular, Atri is an active member of many open source
projects.

== Homogeneous Developers ==
Although initial core developers were based out of India, community now
consists of computer professionals from various parts of the world hence
diversity should not be an issue. In addition, we will be documenting
internals of the project in public facing documents and it shall allow more
contributors to join in.

== Reliance on Salaried Developers ==
It is expected that Concerted development will occur on both salaried time
and on volunteer time. Nupur and Amrish belong to Ingenium and are
committed to building this project along with their team. Atri, as the
originator of this project, will be actively working on the project and is
now pushing Concerted into major data warehousing projects, since he is
involved in architecture of data platforms. Developers are expected to be
contributing in their volunteer time. In addition, we will be working with
various open source projects which will be benefited by Concerted and will
be involving those communities into Concerted's development as well. For
eg, Apache Tajo has shown interest and will be supporting development of
the project.

== Relationships with Other Apache Products ==
Concerted has some overlapping function with Apache Geode(Incubating).
However, Geode is an in memory key value store whereas Concerted is a write
less read many engine. Concerted will complement Geode and increase the use
cases Geode can support with Concerted's help.

A major objective for Concerted is supporting OLAP workloads and data
warehouses with in memory performance and highly performant reads and
joins. Concerted will be collaborating with many open source projects such
as Apache HAWQ (incubating), Apache Hive, Apache Tajo etc to support their
OLAP workloads hence enabling them to support larger set of usecases with a
better throughput. For eg, a star schema in Hive will benefit from having
dimension tables in Concerted with highly efficient and scalable reads and
joins will be very fast. Similar workload for Tajo.

Concerted will fit in many other use cases in Apache spectrum as well. For
eg, Concerted can be used with Apache Geode for in memory aggregation
indexing. Concerted can also be used with Apache Flink for streaming real
time data into in memory, perform in memory aggregation and then performing
batch processing for efficiency.


== A Excessive Fascination with the Apache Brand ==
We believe that the "Apache Way" governance model will provide additional
help to us in finding contributors and growing the community. The community
and development process will make this project more stable and help
establish ubiquitous APIs. In addition, Concerted is looking to support
multiple Apache projects in their use cases and accelerate their
performance while soliciting their support in development of the project.
We will not be using Apache brand for excessive branding or with any
commercial aspects of Concerted. Apache brand will primarily be used for
community building.

= Documentation =
Public documents are currently in development and will be published soon.

= Initial Source =
The initial source is written in C++ and is heavily in development. It will
be restructured and released publicly.
We understand that there might be concerns around github source being
developed by only a single person and development not happening after 2013.
The source on github is only the source initially developed as an
independent project hence the limitation. However, due to reason that
project has been present on github for a while now, it has attracted
attention and people have been using and developing it locally. For eg,
Ingenium Data System took an interest in the project and locally developed
it and used it in an upcoming product they are going to release soon. The
project now wants to accumulate all independent development efforts and
help attract people to grow the community and project. We are currently in
process of updating github repository and making branches for all local
development efforts.

= Source and Intellectual Property Submission Plan =

We intend the entire code base to be licensed under the Apache License,
Version 2.0.

= External Dependencies =
Currently, Concerted only depends on g++ compiler and pthreads. pthreads
will be replaced by Boost in next release.

= Cryptography =

N/A

= Required Resources =
== Mailling List ==
 *private@concerted.incubator.apache.org (moderated subscriptions)
 *commits@concerted.incubator.apache.org
 *dev@concerted.incubator.apache.org
 *issues@concerted.incubator.apache.org

== Git Repository ==

https://git-wip-us.apache.org/repos/asf/incubator-concerted.git

== Issue Tracking ==
Jira Concerted (CONCERTED)

== Other Resources ==
 * Continuous Integration
  * Jenkins
 * Wiki
  * cwiki.apache.org/confluence/display/CONCERTED

= Initial Committers =
 * Roman Shaposhnik <rvs AT apache DOT org>
 * Daniel Dai <daijy AT apache DOT org>
 * Jake Farrell <jfarrell AT apache DOT org>
 * Lars Hofhansl <larsh AT apache DOT org>
 * Julian Hyde <jhyde AT apache DOT org>
 * Chris Nauroth <cnauroth AT hortonworks DOT com>
 * Pavel Stehule <pavel DOT stehule AT gmail.com>
 * Amrish <amrishs AT ingeniumsys DOT com>
 * Nupur S <nupurs AT ingeniumsys DOT com>
 * Atri Sharma <atri AT apache DOT org>
 * Nishith Singhal <nishsinghal AT gmail DOT com>
 * Michael Down <michael AT dowuk DOT com>
 * Vijayakumar Ramdoss <vijayakumar DOT ramdoss AT emc DOT com>
 * Wang Albert <albertwang87 AT gmail DOT com>
 * Hans-Jurgen Schonig <postgres AT cybertec DOT at>
 * Kris Popat <krispopat AT apache DOT org>
 * Ayrton Gomesz <com DOT ayrton AT gmail DOT com>

= Affiliations =
 * Roman Shaposhnik (Pivotal)
 * Daniel Dai (HortonWorks)
 * Jake Farrell (Acquia)
 * Lars Hofhansl (Salesforce)
 * Julian Hyde (HortonWorks)
 * Chris Nauroth (HortonWorks)
 * Pavel Stehule (GoodData)
 * Amrish (Ingenium Data Systems)
 * Nupur S (Ingenium Data Systems)
 * Atri Sharma (Barclays)
 * Nishith Singhal (Wipro)
 * Michael Down (Barclays)
 * Vijayakumar Ramdoss (EMC)
 * Wang Albert (Lehigh University)
 * Hans- Jurgen Schonig (CyberTec)
 * Kris Popat (CETIS LLP)
 * Ayrton Gomesz (IQLabs)

The nominated mentors are employees of Pivotal, HortonWorks, Acquia, and
Salesforce.

 * Daniel Dai (HortonWorks)
 * Jake Farrell (Acquia)
 * Lars Hofhansl (Salesforce)
 * Julian Hyde (HortonWorks)
 * Chris Nauroth (HortonWorks)

= Sponsors =

== Champion ==

 * Roman Shaposhnik (rvs AT apache DOT org)

== Nominated Mentors ==

 * Daniel Dai <daijy AT apache DOT org>
 * Jake Farrell <jfarrell AT apache DOT org>
 * Lars Hofhansl <larsh AT apache DOT org>
 * Julian Hyde <jhyde AT apache DOT org>
 * Chris Nauroth <cnauroth AT hortonworks DOT com>

== Sponsoring Entity ==
Apache Incubator

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message