incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Priyank Ashok Rastogi <priyank.rast...@huawei.com>
Subject RE: [VOTE] Accept Mnemonic into the Apache Incubator
Date Wed, 02 Mar 2016 05:49:40 GMT
+1 (non-binding)

-----Original Message-----
From: Patrick Hunt [mailto:phunt@apache.org] 
Sent: 29 February 2016 23:08
To: general@incubator.apache.org
Subject: [VOTE] Accept Mnemonic into the Apache Incubator

Hi folks,

OK the discussion is now completed. Please VOTE to accept Mnemonic into the Apache Incubator.
I’ll leave the VOTE open for at least the next 72 hours, with hopes to close it Thursday
the 3rd of March, 2016 at 10am PT.
https://wiki.apache.org/incubator/MnemonicProposal

[ ] +1 Accept Mnemonic as an Apache Incubator podling.
[ ] +0 Abstain.
[ ] -1 Don’t accept Mnemonic as an Apache Incubator podling because..

Of course, I am +1 on this. Please note VOTEs from Incubator PMC members are binding but all
are welcome to VOTE!

Regards,

Patrick

--------------------
= Mnemonic Proposal =
=== Abstract ===
Mnemonic is a Java based non-volatile memory library for in-place structured data processing
and computing. It is a solution for generic object and block persistence on heterogeneous
block and byte-addressable devices, such as DRAM, persistent memory, NVMe, SSD, and cloud
network storage.

=== Proposal ===
Mnemonic is a structured data persistence in-memory in-place library for Java-based applications
and frameworks. It provides unified interfaces for data manipulation on heterogeneous block/byte-addressable
devices, such as DRAM, persistent memory, NVMe, SSD, and cloud network devices.

The design motivation for this project is to create a non-volatile programming paradigm for
in-memory data object persistence, in-memory data objects caching, and JNI-less IPC.
Mnemonic simplifies the usage of data object caching, persistence, and JNI-less IPC for massive
object oriented structural datasets.

Mnemonic defines Non-Volatile Java objects that store data fields in persistent memory and
storage. During the program runtime, only methods and volatile fields are instantiated in
Java heap, Non-Volatile data fields are directly accessed via GET/SET operation to and from
persistent memory and storage. Mnemonic avoids SerDes and significantly reduces amount of
garbage in Java heap.

Major features of Mnemonic:
* Provides an abstract level of viewpoint to utilize heterogeneous block/byte-addressable
device as a whole (e.g., DRAM, persistent memory, NVMe, SSD, HD, cloud network Storage).

* Provides seamless support object oriented design and programming without adding burden to
transfer object data to different form.

* Avoids the object data serialization/de-serialization for data retrieval, caching and storage.

* Reduces the consumption of on-heap memory and in turn to reduce and stabilize Java Garbage
Collection (GC) pauses for latency sensitive applications.

* Overcomes current limitations of Java GC to manage much larger memory resources for massive
dataset processing and computing.

* Supports the migration data usage model from traditional NVMe/SSD/HD to non-volatile memory
with ease.

* Uses lazy loading mechanism to avoid unnecessary memory consumption if some data does not
need to use for computing immediately.

* Bypasses JNI call for the interaction between Java runtime application and its native code.

* Provides an allocation aware auto-reclaim mechanism to prevent external memory resource
leaking.


=== Background ===
Big Data and Cloud applications increasingly require both high throughput and low latency
processing. Java-based applications targeting the Big Data and Cloud space should be tuned
for better throughput, lower latency, and more predictable response time.
Typically, there are some issues that impact BigData applications'
performance and scalability:

1) The Complexity of Data Transformation/Organization: In most cases, during data processing,
applications use their own complicated data caching mechanism for SerDes data objects, spilling
to different storage and eviction large amount of data. Some data objects contains complex
values and structure that will make it much more difficulty for data organization. To load
and then parse/decode its datasets from storage consumes high system resource and computation
power.

2) Lack of Caching, Burst Temporary Object Creation/Destruction Causes Frequent Long GC Pauses:
Big Data computing/syntax generates large amount of temporary objects during processing, e.g.
lambda, SerDes, copying and etc. This will trigger frequent long Java GC pause to scan references,
to update references lists, and to copy live objects from one memory location to another blindly.

3) The Unpredictable GC Pause: For latency sensitive applications, such as database, search
engine, web query, real-time/streaming computing, require latency/request-response under control.
But current Java GC does not provide predictable GC activities with large on-heap memory management.

4) High JNI Invocation Cost: JNI calls are expensive, but high performance applications usually
try to leverage native code to improve performance, however, JNI calls need to convert Java
objects into something that C/C++ can understand. In addition, some comprehensive native code
needs to communicate with Java based application that will cause frequently JNI call along
with stack marshalling.

Mnemonic project provides a solution to address above issues and performance bottlenecks for
structured data processing and computing.
It also simplifies the massive data handling with much reduced GC activity.

=== Rationale ===
There are strong needs for a cohesive, easy-to-use non-volatile programing model for unified
heterogeneous memory resources management and allocation. Mnemonic project provides a reusable
and flexible framework to accommodate other special type of memory/block devices for better
performance without changing client code.

Most of the BigData frameworks (e.g., Apache Spark™, Apache™ Hadoop®, Apache HBase™,
Apache Flink™, Apache Kafka™, etc.) have their own complicated memory management modules
for caching and checkpoint. Many approaches increase the complexity and are error-prone to
maintain code.

We have observed heavy overheads during the operations of data parse, SerDes, pack/unpack,
code/decode for data loading, storage, checkpoint, caching, marshal and transferring. Mnemonic
provides a generic in-memory persistence object model to address those overheads for better
performance. In addition, it manages its in-memory persistence objects and blocks in the way
that GC does, which means their underlying memory resource is able to be reclaimed without
explicitly releasing it.

Some existing Big Data applications suffer from poor Java GC behaviors when they process their
massive unstructured datasets.  Those behaviors either cause very long stop-the-world GC pauses
or take significant system resources during computing which impact throughput and incur significant
perceivable pauses for interactive analytics.

There are more and more computing intensive Big Data applications moving down to rely on JNI
to offload their computing tasks to native code which dramatically increases the cost of JNI
invocation and IPC.
Mnemonic provides a mechanism to communicate with native code directly through in-place object
data update to avoid complex object data type conversion and stack marshaling. In addition,
this project can be extended to support various lockers for threads between Java code and
native code.

=== Initial Goals ===
Our initial goal is to bring Mnemonic into the ASF and transit the engineering and governance
processes to the "Apache Way."  We would like to enrich a collaborative development model
that closely aligns with current and future industry memory and storage technologies.

Another important goal is to encourage efforts to integrate non-volatile programming model
into data centric processing/analytics frameworks/applications, (e.g., Apache Spark™, Apache
HBase™, Apache Flink™, Apache™ Hadoop®, Apache Cassandra™,  etc.).

We expect Mnemonic project to be continuously developing new functionalities in an open, community-driven
way. We envision accelerating innovation under ASF governance in order to meet the requirements
of a wide variety of use cases for in-memory non-volatile and volatile data caching programming.

=== Current Status ===
Mnemonic project is available at Intel’s internal repository and managed by its designers
and developers. It is also temporary hosted at Github for general view https://github.com/NonVolatileComputing/Mnemonic.git

We have integrated this project for Apache Spark™ 1.5.0 and get 2X performance improvement
ratio for Spark™ MLlib k-means workload and observed expected benefits of removing SerDes,
reducing total GC pause time by 40% from our experiments.

==== Meritocracy ====
Mnemonic was originally created by Gang (Gary) Wang and Yanping Wang in early 2015. The initial
committers are the current Mnemonic R&D team members from US, China, and India Big Data
Technologies Group at Intel. This group will form a base for much broader community to collaborate
on this code base.

We intend to radically expand the initial developer and user community by running the project
in accordance with the "Apache Way." Users and new contributors will be treated with respect
and welcomed. By participating in the community and providing quality patches/support that
move the project forward, they will earn merit. They also will be encouraged to provide non-code
contributions (documentation, events, community management, etc.) and will gain merit for
doing so. Those with a proven support and quality track record will be encouraged to become
committers.

==== Community ====
If Mnemonic is accepted for incubation, the primary initial goal is to transit the core community
towards embracing the Apache Way of project governance. We would solicit major existing contributors
to become committers on the project from the start.

==== Core Developers ====
Mnemonic core developers are all skilled software developers and system performance engineers
at Intel Corp with years of experiences in their fields. They have contributed many code to
Apache projects.
There are PMCs and experienced committers have been working with us from Apache Spark™,
Apache HBase™, Apache Phoenix™, Apache™ Hadoop® for this project's open source efforts.

=== Alignment ===
The initial code base is targeted to data centric processing and analyzing in general. Mnemonic
has been building the connection and integration for Apache projects and other projects.

We believe Mnemonic will be evolved to become a promising project for real-time processing,
in-memory streaming analytics and more, along with current and future new server platforms
with persistent memory as base storage devices.

=== Known Risks ===
==== Orphaned products ====
Intel’s Big Data Technologies Group is actively working with community on integrating this
project to Big Data frameworks and applications.
We are continuously adding new concepts and codes to this project and support new usage cases
and features for Apache Big Data ecosystem.

The project contributors are leading contributors of Hadoop-based technologies and have a
long standing in the Hadoop community. As we are addressing major Big Data processing performance
issues, there is minimal risk of this work becoming non-strategic and unsupported.

Our contributors are confident that a larger community will be formed within the project in
a relatively short period of time.

==== Inexperience with Open Source ====
This project has long standing experienced mentors and interested contributors from Apache
Spark™, Apache HBase™, Apache Phoenix™, Apache™ Hadoop® to help us moving through
open source process. We are actively working with experienced Apache community PMCs and committers
to improve our project and further testing.

==== Homogeneous Developers ====
All initial committers and interested contributors are employed at Intel. As an infrastructure
memory project, there are wide range of Apache projects are interested in innovative memory
project to fit large sized persistent memory and storage devices. Various Apache projects
such as Apache Spark™, Apache HBase™, Apache Phoenix™, Apache Flink™, Apache Cassandra™
etc. can take good advantage of this project to overcome serialization/de-serialization, Java
GC, and caching issues. We expect a wide range of interest will be generated after we open
source this project to Apache.

==== Reliance on Salaried Developers ==== All developers are paid by their employers to contribute
to this project. We welcome all others to contribute to this project after it is open sourced.

==== Relationships with Other Apache Product ==== Relationship with Apache™ Arrow:
Arrow's columnar data layout allows great use of CPU caches & SIMD. It places all data
that relevant to a column operation in a compact format in memory.

Mnemonic directly puts the whole business object graphs on external heterogeneous storage
media, e.g. off-heap, SSD. It is not necessary to normalize the structures of object graphs
for caching, checkpoint or storing. It doesn’t require developers to normalize their data
object graphs. Mnemonic applications can avoid indexing & join datasets compared to traditional
approaches.

Mnemonic can leverage Arrow to transparently re-layout qualified data objects or create special
containers that is able to efficiently hold those data records in columnar form as one of
major performance optimization constructs.

Mnemonic can be integrated into various Big Data and Cloud frameworks and applications.
We are currently working on several Apache projects with Mnemonic:
For Apache Spark™ we are integrating Mnemonic to improve:
a) Local checkpoints
b) Memory management for caching
c) Persistent memory datasets input
d) Non-Volatile RDD operations
The best use case for Apache Spark™ computing is that the input data is stored in form of
Mnemonic native storage to avoid caching its row data for iterative processing. Moreover,
Spark applications can leverage Mnemonic to perform data transforming in persistent or non-persistent
memory without SerDes.

For Apache™ Hadoop®, we are integrating HDFS Caching with Mnemonic instead of mmap. This
will take advantage of persistent memory related features. We also plan to evaluate to integrate
in Namenode Editlog, FSImage persistent data into Mnemonic persistent memory area.

For Apache HBase™, we are using Mnemonic for BucketCache and evaluating performance improvements.

We expect Mnemonic will be further developed and integrated into many Apache BigData projects
and so on, to enhance memory management solutions for much improved performance and reliability.

==== An Excessive Fascination with the Apache Brand ==== While we expect Apache brand helps
to attract more contributors, our interests in starting this project is based on the factors
mentioned in the Rationale section.

We would like Mnemonic to become an Apache project to further foster a healthy community of
contributors and consumers in BigData technology R&D areas. Since Mnemonic can directly
benefit many Apache projects and solves major performance problems, we expect the Apache Software
Foundation to increase interaction with the larger community as well.

=== Documentation ===
The documentation is currently available at Intel and will be posted
under: https://mnemonic.incubator.apache.org/docs

=== Initial Source ===
Initial source code is temporary hosted Github for general viewing:
https://github.com/NonVolatileComputing/Mnemonic.git
It will be moved to Apache http://git.apache.org/ after podling.

The initial Source is written in Java code (88%) and mixed with JNI C code (11%) and shell
script (1%) for underlying native allocation libraries.

=== Source and Intellectual Property Submission Plan === As soon as Mnemonic is approved to
join the Incubator, the source code will be transitioned via the Software Grant Agreement
onto ASF infrastructure and in turn made available under the Apache License, version 2.0.

=== External Dependencies ===
The required external dependencies are all Apache licenses or other compatible Licenses
Note: The runtime dependent licenses of Mnemonic are all declared as Apache 2.0, the GNU licensed
components are used for Mnemonic build and deployment. The Mnemonic JNI libraries are built
using the GNU tools.

maven and its plugins (http://maven.apache.org/ ) [Apache 2.0]
JDK8 or OpenJDK 8 (http://java.com/) [Oracle or Openjdk JDK License] Nvml (http://pmem.io
) [optional] [Open Source] PMalloc (https://github.com/bigdata-memory/pmalloc ) [optional]
[Apache 2.0]

Build and test dependencies:
org.testng.testng v6.8.17  (http://testng.org) [Apache 2.0] org.flowcomputing.commons.commons-resgc
v0.8.7 [Apache 2.0] org.flowcomputing.commons.commons-primitives v.0.6.0 [Apache 2.0] com.squareup.javapoet
v1.3.1-SNAPSHOT [Apache 2.0]
JDK8 or OpenJDK 8 (http://java.com/) [Oracle or Openjdk JDK License]

=== Cryptography ===
Project Mnemonic does not use cryptography itself, however, Hadoop projects use standard APIs
and tools for SSH and SSL communication where necessary.

=== Required Resources ===
We request that following resources be created for the project to use

==== Mailing lists ====
private@mnemonic.incubator.apache.org (moderated subscriptions) commits@mnemonic.incubator.apache.org
dev@mnemonic.incubator.apache.org

==== Git repository ====
https://github.com/apache/incubator-mnemonic

==== Documentation ====
https://mnemonic.incubator.apache.org/docs/

==== JIRA instance ====
https://issues.apache.org/jira/browse/mnemonic

=== Initial Committers ===
* Gang (Gary) Wang (gang1 dot wang at intel dot com)

* Yanping Wang (yanping dot wang at intel dot com)

* Uma Maheswara Rao G (umamahesh at apache dot org)

* Kai Zheng (drankye at apache dot org)

* Rakesh Radhakrishnan Potty  (rakeshr at apache dot org)

* Sean Zhong  (seanzhong at apache dot org)

* Henry Saputra  (hsaputra at apache dot org)

* Hao Cheng (hao dot cheng at intel dot com)

=== Additional Interested Contributors ===
* Debo Dutta (dedutta at cisco dot com)

* Liang Chen (chenliang613 at Huawei dot com)

=== Affiliations ===
* Gang (Gary) Wang, Intel

* Yanping Wang, Intel

* Uma Maheswara Rao G, Intel

* Kai Zheng, Intel

* Rakesh Radhakrishnan Potty, Intel

* Sean Zhong, Intel

* Henry Saputra, Independent

* Hao Cheng, Intel

=== Sponsors ===
==== Champion ====
Patrick Hunt

==== Nominated Mentors ====
* Patrick Hunt <phunt at apache dot org> - Apache IPMC member

* Andrew Purtell <apurtell at apache dot org > - Apache IPMC member

* James Taylor <jamestaylor at apache dot org> - Apache IPMC member

* Henry Saputra <hsaputra at apache dot org> - Apache IPMC member

==== Sponsoring Entity ====
Apache Incubator PMC

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Mime
View raw message