incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Incubator Wiki] Update of "ClimateModelDiagnosticAnalyzerProposal" by LeiPan
Date Thu, 26 Feb 2015 22:57:36 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "ClimateModelDiagnosticAnalyzerProposal" page has been changed by LeiPan:

- <?xml version="1.0" encoding="utf-8"?><!DOCTYPE article  PUBLIC '-//OASIS//DTD
DocBook XML V4.4//EN'  ''><article><articleinfo><title>AsterixDBProposal</title><revhistory><revision><revnumber>12</revnumber><date>2015-02-24
Ate Douma</revremark></revision><revision><revnumber>11</revnumber><date>2015-01-23
Vassilis to the initial committers</revremark></revision><revision><revnumber>8</revnumber><date>2015-01-20
Ted Dunning as Mentor</revremark></revision><revision><revnumber>7</revnumber><date>2015-01-20
Henry as a nominated mentor</revremark></revision><revision><revnumber>4</revnumber><date>2015-01-16
Inci to the initial committers</revremark></revision><revision><revnumber>3</revnumber><date>2015-01-15
of lists and wiki-names</revremark></revision><revision><revnumber>2</revnumber><date>2015-01-15
AsterixDB Proposal</title><section><title>Abstract</title><para>Apache
AsterixDB is a scalable big data management system (BDMS) that provides storage, management,
and query capabilities for large collections of semi-structured data. </para></section><section><title>Proposal</title><para>AsterixDB
is a big data management system (BDMS) that makes it well-suited to needs such as web data
warehousing and social data storage and analysis. Feature-wise, AsterixDB has: </para><itemizedlist><listitem><para>A
NoSQL style data model (ADM) based on extending JSON with object database concepts. </para></listitem><listitem><para>An
expressive and declarative query language (AQL) for querying semi-structured data. </para></listitem><listitem><para>A
runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans.
</para></listitem><listitem><para>Partitioned LSM-based data storage
and indexing for efficient ingestion of newly arriving data. </para></listitem><listitem><para>Support
for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB.
</para></listitem><listitem><para>A rich set of primitive data types,
including support for spatial, temporal, and textual data. </para></listitem><listitem><para>Indexing
options that include B+ trees, R trees, and inverted keyword index support. </para></listitem><listitem><para>Basic
transactional (concurrency and recovery) capabilities akin to those of a NoSQL store. </para></listitem></itemizedlist></section><section><title>Background
and Rationale</title><para>In the world of relational databases, the need to tackle
data volumes that exceed the capabilities of a single server led to the development of “shared-nothing”
parallel database systems several decades ago. These systems spread data over a cluster based
on a partitioning strategy, such as hash partitioning, and queries are processed by employing
partitioned-parallel divide-and-conquer techniques. Since these systems are fronted by a high-level,
declarative language (SQL), their users are shielded from the complexities of parallel programming.
Parallel database systems have been an extremely successful application of parallel computing,
and quite a number of commercial products exist today. </para><para>In the distributed
systems world, the Web brought a need to index and query its huge content. SQL and relational
databases were not the answer, though shared-nothing clusters again emerged as the hardware
platform of choice. Google developed the Google File System (GFS) and MapReduce programming
model to allow programmers to store and process Big Data by writing a few user-defined functions.
The MapReduce framework applies these functions in parallel to data instances in distributed
files (map) and to sorted groups of instances sharing a common key (reduce) -- not unlike
the partitioned parallelism in parallel database systems. Apache's Hadoop MapReduce platform
is the most prominent implementation of this paradigm for the rest of the Big Data community.
On top of Hadoop and HDFS sit declarative languages like Pig and Hive that each compile down
to Hadoop MapReduce jobs. </para><para>The big Web companies were also challenged
by extreme user bases (100s of millions of users) and needed fast simple lookups and updates
to very large keyed data sets like user profiles. SQL databases were deemed either too expensive
or not scalable, so the “NoSQL movement” was born. The ASF now has HBase and Cassandra,
two popular key-value stores, in this space. MongoDB and Apache CouchDB are other open source
alternatives (document stores). </para><para>It is evident from the rapidly growing
popularity of &quot;NoSQL&quot; stores, as well as the strong demand for Big Data
analytics engines today, that there is a strong (and growing!) need to store, process, *and*
query large volumes of semi-structured data in many application areas. Until very recently,
developers have had to <emphasis>choose</emphasis> between using big data analytics
engines like Apache Hive or Apache Spark, which can do complex query processing and analysis
over HDFS-resident files, and flexible but low-function data stores like MongoDB or Apache
HBase. (The Apache Phoenix project, <ulink url=""/>, is a
recent SQL-over-HBase effort that aims to bridge between these choices.) </para><para>AsterixDB
is a highly scalable data management system that can store, index, and manage semi-structured
data, e.g., much like MongoDB, but it also supports a full-power query language with the expressiveness
of SQL (and more). Unlike analytics engines like Hive or Spark, it stores and manages data,
so AsterixDB can exploit its knowledge of data partitioning and the availability of indexes
to avoid always scanning data set(s) to process queries. Somewhat surprisingly, there is no
open source parallel database system (relational or otherwise) available to developers today
-- AsterixDB aims to fill this need. Since Apache is where the majority of the today's most
important Big Data technologies live, the ASF seems like the obvious home for a system like
AsterixDB. </para></section><section><title>Current Status</title><para>The
current version of AsterixDB was co-developed by a team of faculty, staff, and students at
UC Irvine and UC Riverside. The project was initiated as a large NSF-sponsored project in
2009, the goal of which was to combine the best ideas from the parallel database world, the
then new Hadoop world, and the semi-structured (e.g., XML/JSON) data world in order to create
a next-generation BDMS. A first informal open source release was made four years later, in
June of 2013, under the Apache Software License 2.0. </para></section><section><title>Meritocracy</title><para>The
current developers are familiar with meritocratic open source development at Apache. Apache
was chosen specifically because we want to encourage this style of development for the project.
AsterixDB started as a university project it has developed into a community. A number of the
initial committers started contributing in academia and continue to actively participate and
contribute after graduation. And we seek to further develop developer and user communities.
One way to broaden the community that is ongoing is through academic collaborations (currently
with IIT Mumbai in India and TU Berlin in Germany). During incubation we will also explicitly
seek increased industrial participation. </para><para>Some indicators of the effort's
development community and history can be found at: <ulink url=";sort=commits_12_mo"/>,
<ulink url=";sort=commits_12_mo"/>
</para></section><section><title>Core Developers</title><para>The
core developers of the project are diverse, although initially UC Irvine heavy (roughly 50%)
due to the project's origins at UCI. The other 50% are from other academic institutions (UC
Riverside and the Hebrew University in Jerusalem) and companies (Couchbase, IBM, KACST Saudi
Arabia, Oracle, Saudi Aramco, X15 Software). </para></section><section><title>Alignment</title><para>Apache
is, by far, the most natural home for taking the AsterixDB project forward. A large fraction
of today's top Big Data technologies have their homes in Apache, including Hadoop, YARN, Pig,
Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a significant gap -- the
parallel data management system gap -- that exists in the Big Data open source world. It is
well-aligned with a number of the Apache projects, e.g., it has strong support for accessing
and indexing external data in HDFS, and it uses YARN as an answer to basic cluster resource
management. AsterixDB also seeks to achieve an Apache-style development model; it is seeking
a broader community of contributors and users in order to achieve its full potential and value
to the Big Data community. </para><para>There are also a number of related Apache
projects and dependencies that will be mentioned below in the Relationships with Other Apache
products section. </para></section><section><title>Known Risks</title><section><title>Orphaned
products</title><para>Given the current level of intellectual investment in AsterixDB,
the risk of the project being abandoned is very small. The UCI/UCR faculty team leads are
highly incentivized to continue development since the database groups at UC Irvine and UC
Riverside are both reliant on AsterixDB as a platform for long-term graduate research projects.
UC San Diego is also beginning to contribute to the code base, and a collaboration involving
public health applications is forming with UCLA. The work on AsterixDB is managed via a mix
of mailing list discussions supplemented by weekly project status meetings which are summarized
on the mailing list. Typical (local plus Skype-in) attendance to the weekly status meetings
runs at about 20 active contributors. </para></section><section><title>Inexperience
with Open Source</title><para>AsterixDB and Hyracks were completely developed
in Open Source under the ALv2. The source code repositories, issue tracker, and mailing lists
are available on Google Code and discussions and decisions happen on the mailing lists (which
is necessary due to the geographic distribution of the current developers). </para><para>Also
a few of the initial committers have contributed to Apache projects. Vinayak Borkar is a committer
on the Apache Helix and Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
and an IPMC member. Preston Carman and Steven Jacobs are committers on the Apache VXQuery
project. </para></section><section><title>Relationships with Other
Apache Products</title><para>Apache VXQuery is based on the Hyracks data-parallel
runtime, which is also included in the AsterixDB code base. </para><para>AsterixDB
is closely related to Apache Hadoop. Included in AsterixDB is support for accessing external
data in HDFS (and Hive formats), and resource management and system administration features
are in the process of being migrated to YARN. </para><para>AsterixDB's AQL query
facilities offer comparable query power to Apache's Pig and Hive systems for big data analytics.
AsterixDB differs in storing and indexing data and thus being able to quickly answer small
and medium queries without large HDFS data scans - thereby targeting a different class of
use cases. </para><para>AsterixDB's data storage and indexing facilities are similar
to those of HBase, but AsterixDB differs in being a much more complete and queryable BDMS
(not just a key-value style store). </para><para>AsterixDB's target use cases
are not in-memory processing or iterative algorithm support, making AsterixDB complementary
to the Apache Spark platform. (Spark interoperability is on our longer-term to-do wishlist.)
</para></section><section><title>Homogeneous Developers</title><para>As
mentioned before the current community is already organizationally and geographically distributed
- and we would like to increase the heterogeneity. </para></section><section><title>Reliance
on Salaried Developers</title><para>Of the initial committers only 3 are full-time
UCI staff. The other committers are a mix of students, alumni who continue to contribute to
the effort, and individuals working with permission part-time (or in spare time) on this project.
</para></section><section><title>A Excessive Fascination with the
Apache Brand</title><para>We believe in the processes, systems, and framework
Apache has put in place. Apache is also known to foster a great community around their projects
and provide exposure. While brand is important, our fascination with it is not excessive.
We believe that the ASF is the right home for AsterixDB and that having AsterixDB inside of
the ASF will lead to a better long-term outcome for the Big Data community. </para></section><section><title>Documentation</title><para>Documentation
and publications related to AsterixDB can be found at <ulink url=""/>.
</para></section><section><title>Initial Source</title><para>Current
source resides in Google code: <ulink url=""/> (query
language and upper system layers) and <ulink url=""/>
(dataflow runtime system and storage management libraries). </para></section><section><title>External
Dependencies</title><para>AsterixDB depends on a number of Apache projects: </para><itemizedlist><listitem><para>Ant
</para></listitem><listitem><para>Avro </para></listitem><listitem><para>ApacheDB
JDO </para></listitem><listitem><para>Commons </para></listitem><listitem><para>Derby
</para></listitem><listitem><para>Hadoop </para></listitem><listitem><para>Hive
</para></listitem><listitem><para>HTTPComponents </para></listitem><listitem><para>Jakarta
ORO </para></listitem><listitem><para>Maven </para></listitem><listitem><para>Tomcat
</para></listitem><listitem><para>Thrift </para></listitem><listitem><para>Velocity
</para></listitem><listitem><para>Wicket </para></listitem><listitem><para>Xerces
</para></listitem></itemizedlist><para>and other open source projects
(organized by license): </para><itemizedlist><listitem><para>ALv2:
</para><itemizedlist><listitem><para>Jackson </para></listitem><listitem><para>Google
Guava </para></listitem><listitem><para>Google Guice </para></listitem><listitem><para>JSON-simple
</para></listitem><listitem><para>BoneCP </para></listitem><listitem><para>Microsoft
Azure SDK </para></listitem><listitem><para>Netty </para></listitem><listitem><para>Rome
</para></listitem><listitem><para>JetS3t </para></listitem><listitem><para>Groovy
</para></listitem><listitem><para>Jettison </para></listitem><listitem><para>Plexus
</para></listitem><listitem><para>Datanucleus (JDO) </para></listitem><listitem><para>Jetty
</para></listitem><listitem><para>Twitter4J </para></listitem><listitem><para>Snappy-java
</para><itemizedlist><listitem><para>Antlr </para></listitem><listitem><para>ObjectWeb
ASM </para></listitem><listitem><para>Protobuf </para></listitem><listitem><para>JSCH
</para></listitem><listitem><para>JavaCC </para></listitem><listitem><para>Paranamer
</para></listitem><listitem><para>JLine </para></listitem><listitem><para>Stax
</para></listitem><listitem><para>StringTemplate </para></listitem><listitem><para>xmlEnc
</para><itemizedlist><listitem><para>AppAssembler </para></listitem><listitem><para>SimpleLog4J
1.0 </para><itemizedlist><listitem><para>Java Activation Framework
</para></listitem><listitem><para>Java Transactions </para></listitem><listitem><para>Java
Servlet API </para></listitem><listitem><para>Grizzly </para></listitem><listitem><para>gmbal
</para></listitem><listitem><para>Glassfish </para></listitem></itemizedlist></listitem><listitem><para>CDDL
1.1 </para><itemizedlist><listitem><para>Jersey </para></listitem><listitem><para>JAXB
Reference Implementation </para></listitem></itemizedlist></listitem><listitem><para>JSON
License </para><itemizedlist><listitem><para>JSON </para></listitem></itemizedlist></listitem><listitem><para>EPL
1.0 </para><itemizedlist><listitem><para>JUnit </para></listitem></itemizedlist></listitem><listitem><para>JDOM
License </para><itemizedlist><listitem><para>JDOM </para></listitem></itemizedlist></listitem><listitem><para>Public
Domain </para><itemizedlist><listitem><para>xz </para></listitem><listitem><para>AOPAlliance
all dependencies are managed using Apache Maven, none of the external libraries need to be
packaged in a source distribution. </para></section></section><section><title>Required
Resources</title><section><title>Developer and user mailing lists</title><itemizedlist><listitem><para><ulink
(with moderated subscriptions) </para></listitem><listitem><para><ulink
</para></listitem><listitem><para><ulink url=""></ulink>
</para></listitem><listitem><para><ulink url=""></ulink>
</para></listitem></itemizedlist><para>A git repository </para><para><ulink
url=""/> </para><para>A
JIRA issue tracker </para><para><ulink url=""/>
</para></section></section><section><title>Initial Committers</title><para>The
following is a list of the planned initial Apache committers (the active subset of the committers
for the current repository at Google code). </para><itemizedlist><listitem><para>Abdullah
Alamoudi (<ulink url=""></ulink>) </para></listitem><listitem><para>Cameron
Samak (<ulink url=""></ulink>) </para></listitem><listitem><para>Chen
Li (<ulink url=""></ulink>) </para></listitem><listitem><para>Ian
Maxon (<ulink url=""></ulink>) </para></listitem><listitem><para>Inci
Cetindil (<ulink url=""></ulink>)
</para></listitem><listitem><para>Ildar Absalyamov (<ulink url=""></ulink>)
</para></listitem><listitem><para>Jianfeng Jia (<ulink url=""></ulink>)
</para></listitem><listitem><para>Karen Ouaknine (<ulink url=""></ulink>)
</para></listitem><listitem><para>Markus Dreseler (<ulink url=""></ulink>)
</para></listitem><listitem><para>Mike Carey (<ulink url=""></ulink>)
</para></listitem><listitem><para>Murtadha Hubail (<ulink url=""></ulink>)
</para></listitem><listitem><para>Pouria Pirzadeh (<ulink url=""></ulink>)
</para></listitem><listitem><para>Preston Carman (<ulink url=""></ulink>)
</para></listitem><listitem><para>Raman Grover (<ulink url=""></ulink>)
</para></listitem><listitem><para>Sattam Alsubaiee (<ulink url=""></ulink>)
</para></listitem><listitem><para>Steven Jacobs (<ulink url=""></ulink>)
</para></listitem><listitem><para>Taewoo Kim (<ulink url=""></ulink>)
</para></listitem><listitem><para>Till Westmann (<ulink url=""></ulink>)
</para></listitem><listitem><para>Vassilis Tsotras (<ulink url=""></ulink>)
</para></listitem><listitem><para>Vinayak Borkar (<ulink url=""></ulink>)
</para></listitem><listitem><para>Yingyi Bu (<ulink url=""></ulink>)
</para></listitem><listitem><para>Young-Seok Kim (<ulink url=""></ulink>)
</para></listitem><listitem><para>Zach Heilbron (<ulink url=""></ulink>)
Irvine </para><itemizedlist><listitem><para>Mike Carey </para></listitem><listitem><para>Chen
Li </para></listitem><listitem><para>Ian Maxon </para></listitem><listitem><para>Inci
Cetindil </para></listitem><listitem><para>Yingyi Bu </para></listitem><listitem><para>Raman
Grover </para></listitem><listitem><para>Pouria Pirzadeh </para></listitem><listitem><para>Young-Seok
Kim </para></listitem><listitem><para>Cameron Samak </para></listitem><listitem><para>Taewoo
Kim </para></listitem><listitem><para>Jianfeng Jia </para></listitem><listitem><para>Murtadha
Hubail </para></listitem><listitem><para>Markus Dreseler </para></listitem></itemizedlist><para>UC
Riverside </para><itemizedlist><listitem><para>Ildar Absalyamov </para></listitem><listitem><para>Preston
Carman </para></listitem><listitem><para>Steven Jacobs </para></listitem><listitem><para>Vassilis
Tsotras </para></listitem></itemizedlist><para>Hebrew University </para><itemizedlist><listitem><para>Keren
Ouaknine </para></listitem></itemizedlist><para>Oracle </para><itemizedlist><listitem><para>Till
Westmann </para></listitem></itemizedlist><para>X15 Software </para><itemizedlist><listitem><para>Vinayak
Borkar </para></listitem><listitem><para>Zach Heilbron </para></listitem></itemizedlist><para>KACST
Saudi Arabia </para><itemizedlist><listitem><para>Sattam Alsubaiee
</para></listitem></itemizedlist><para>Saudi Aramco </para><itemizedlist><listitem><para>Abdullah
Alamoudi </para></listitem></itemizedlist><para>Carey, Li, and Maxon
are full-time UCI (UC Irvine) staff, Tsotras is full-time UCR (UC Riverside) staff, with the
remaining UCI and UCR affiliates being students. The non-UC committers are a mix of alumni
who continue to contribute to the effort and individuals working with permission part-time
(or in spare time) on this project. </para></section><section><title>Sponsors</title><section><title>Champion</title><para>Chris
Mattmann (NASA/JPL) </para></section><section><title>Nominated Mentors</title><itemizedlist><listitem><para>Henry
Saputra </para></listitem><listitem><para>Jochen Wiedmann </para></listitem><listitem><para>Ted
Dunning </para></listitem><listitem><para>Ate Douma </para></listitem></itemizedlist></section><section><title>Sponsoring
Entity</title><para>The Apache Incubator </para></section></section></section></article>
+ Describe ClimateModelDiagnosticAnalyzerProposal here.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message