incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Westmann <westm...@gmail.com>
Subject Re: [PROPOSAL] Apache AsterixDB Incubator
Date Thu, 15 Jan 2015 06:09:01 GMT
Hi,

if you read the proposal all the way to the end you will see that - while we do have some
community and code - we don’t have mentors.
So if you like the proposal, please volunteer.

Cheers,
Till

> On Jan 14, 2015, at 6:21 PM, Mattmann, Chris A (3980) <chris.a.mattmann@jpl.nasa.gov>
wrote:
> 
> Hi Folks,
> 
> I am pleased to bring forth the Apache AsterixDB proposal to the
> Apache Incubator as Champion, working in collaboration with the
> team. Please find the wiki proposal here:
> 
> https://wiki.apache.org/incubator/AsterixDBProposal
> 
> 
> Full text of the proposal is below. Please discuss and enjoy. I’ll
> leave the discussion open for a week, and then look to call a VOTE
> hopefully end of next week if all is well.
> 
> Cheers!
> Chris Mattmann
> 
> =============================================================
> Apache AsterixDB Proposal
> 
> Abstract
> 
> Apache AsterixDB is a scalable big data management system (BDMS) that
> provides storage, management, and query capabilities for large
> collections of semi-structured data.
> 
> Proposal
> 
> AsterixDB is a big data management system (BDMS) that makes it
> well-suited to needs such as web data warehousing and social data
> storage and analysis. Feature-wise, AsterixDB has:
> 
> * A NoSQL style data model (ADM) based on extending JSON with object
>  database concepts.
> * An expressive and declarative query language (AQL) for querying
>  semi-structured data.
> * A runtime query execution engine, Hyracks, for partitioned-parallel
>  execution of query plans.
> * Partitioned LSM-based data storage and indexing for efficient
>  ingestion of newly arriving data.
> * Support for querying and indexing external data (e.g., in HDFS) as
>  well as data stored within AsterixDB.
> * A rich set of primitive data types, including support for spatial,
>  temporal, and textual data.
> * Indexing options that include B+ trees, R trees, and inverted
>  keyword index support.
> * Basic transactional (concurrency and recovery) capabilities akin to
>  those of a NoSQL store.
> 
> 
> Background and Rationale
> 
> In the world of relational databases, the need to tackle data volumes
> that exceed the capabilities of a single server led to the
> development of “shared-nothing” parallel database systems several
> decades ago. These systems spread data over a cluster based on a
> partitioning strategy, such as hash partitioning, and queries are
> processed by employing partitioned-parallel divide-and-conquer
> techniques. Since these systems are fronted by a high-level,
> declarative language (SQL), their users are shielded from the
> complexities of parallel programming. Parallel database systems have
> been an extremely successful application of parallel computing, and
> quite a number of commercial products exist today.
> 
> In the distributed systems world, the Web brought a need to index and
> query its huge content. SQL and relational databases were not the
> answer, though shared-nothing clusters again emerged as the hardware
> platform of choice. Google developed the Google File System (GFS) and
> MapReduce programming model to allow programmers to store and process
> Big Data by writing a few user-defined functions. The MapReduce
> framework applies these functions in parallel to data instances in
> distributed files (map) and to sorted groups of instances sharing a
> common key (reduce) -- not unlike the partitioned parallelism in
> parallel database systems. Apache's Hadoop MapReduce platform is the
> most prominent implementation of this paradigm for the rest of the
> Big Data community. On top of Hadoop and HDFS sit declarative
> languages like Pig and Hive that each compile down to Hadoop
> MapReduce jobs.
> 
> The big Web companies were also challenged by extreme user bases
> (100s of millions of users) and needed fast simple lookups and
> updates to very large keyed data sets like user profiles. SQL
> databases were deemed either too expensive or not scalable, so the
> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
> popular key-value stores, in this space. MongoDB and Couchbase are
> other open source alternatives (document stores).
> 
> It is evident from the rapidly growing popularity of "NoSQL" stores,
> as well as the strong demand for Big Data analytics engines today,
> that there is a strong (and growing!) need to store, process, *and*
> query large volumes of semi-structured data in many application
> areas. Until very recently, developers have had to ``choose'' between
> using big data analytics engines like Apache Hive or Apache Spark,
> which can do complex query processing and analysis over HDFS-resident
> files, and flexible but low-function data stores like MongoDB or
> Apache HBase. (The Apache Phoenix project,
> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
> aims to bridge between these choices.)
> 
> AsterixDB is a highly scalable data management system that can store,
> index, and manage semi-structured data, e.g., much like MongoDB, but
> it also supports a full-power query language with the expressiveness
> of SQL (and more). Unlike analytics engines like Hive or Spark, it
> stores and manages data, so AsterixDB can exploit its knowledge of
> data partitioning and the availability of indexes to avoid always
> scanning data set(s) to process queries. Somewhat surprisingly, there
> is no open source parallel database system (relational or otherwise)
> available to developers today -- AsterixDB aims to fill this need.
> Since Apache is where the majority of the today's most important Big
> Data technologies live, the ASF seems like the obvious home for a
> system like AsterixDB.
> 
> Current Status
> 
> The current version of AsterixDB was co-developed by a team of
> faculty, staff, and students at UC Irvine and UC Riverside. The
> project was initiated as a large NSF-sponsored project in 2009, the
> goal of which was to combine the best ideas from the parallel
> database world, the then new Hadoop world, and the semi-structured
> (e.g., XML/JSON) data world in order to create a next-generation
> BDMS. A first informal open source release was made four years later,
> in June of 2013, under the Apache Software License 2.0.
> 
> 
> Meritocracy
> 
> The current developers are familiar with meritocratic open source
> development at Apache. Apache was chosen specifically because we want
> to encourage this style of development for the project.
> 
> 
> Community
> 
> While AsterixDB started as a university project it has developed into
> a community. A number of the initial committers started contributing
> in academia and continue to actively participate and contribute after
> graduation. And we seek to further develop developer and user
> communities. One way to broaden the community that is ongoing is
> through academic collaborations (currently with IIT Mumbai in India
> and TU Berlin in Germany). During incubation we will also explicitly
> seek increased industrial participation.
> 
> Some indicators of the effort's development community and history can
> be
> found at:
> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
> 
> 
> Core Developers
> 
> The core developers of the project are diverse, although initially UC
> Irvine heavy (roughly 50) due to the project's origins at UCI. The
> other 50 are from other academic institutions (UC Riverside and the
> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
> 
> 
> Alignment
> 
> Apache is, by far, the most natural home for taking the AsterixDB
> project forward. A large fraction of today's top Big Data
> technologies have their homes in Apache, including Hadoop, YARN, Pig,
> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
> significant gap -- the parallel data management system gap -- that
> exists in the Big Data open source world. It is well-aligned with a
> number of the Apache projects, e.g., it has strong support for
> accessing and indexing external data in HDFS, and it uses YARN as an
> answer to basic cluster resource management. AsterixDB also seeks to
> achieve an Apache-style development model; it is seeking a broader
> community of contributors and users in order to achieve its full
> potential and value to the Big Data community.
> 
> There are also a number of related Apache projects and dependencies
> that will be mentioned below in the Relationships with Other Apache
> products section.
> 
> 
> Known Risks
> 
> Orphaned products
> 
> Given the current level of intellectual investment in AsterixDB, the
> risk of the project being abandoned is very small. The UCI/UCR
> faculty team leads are highly incentivized to continue development
> since the database groups at UC Irvine and UC Riverside are both
> reliant on AsterixDB as a platform for long-term graduate research
> projects. UC San Diego is also beginning to contribute to the code
> base, and a collaboration involving public health applications is
> forming with UCLA. The work on AsterixDB is managed via a mix of
> mailing list discussions supplemented by weekly project status
> meetings which are summarized on the mailing list. Typical (local
> plus Skype-in) attendance to the weekly status meetings runs at about
> 20 active contributors.
> 
> 
> Inexperience with Open Source
> 
> AsterixDB and Hyracks were completely developed in Open Source under
> the ASL 2.0. The source code repositories, issue tracker, and mailing
> lists are available on Google Code and discussions and decisions
> happen on the mailing lists (which is necessary due to the geographic
> distribution of the current developers).
> 
> Also a few of the initial committers have contributed to Apache
> projects. Vinayak Borkar is a committer on the Apache Helix and
> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
> and an IPMC member. Preston Carman and Steven Jacobs are committers
> on the Apache VXQuery project.
> 
> 
> Relationships with Other Apache Products
> 
> Apache VXQuery is based on the Hyracks data-parallel runtime, which
> is also included in the AsterixDB code base.
> 
> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
> is support for accessing external data in HDFS (and Hive formats),
> and resource management and system administration features are in the
> process of being migrated to YARN.
> 
> AsterixDB's AQL query facilities offer comparable query power to
> Apache's Pig and Hive systems for big data analytics. AsterixDB
> differs in storing and indexing data and thus being able to quickly
> answer small and medium queries without large HDFS data scans -
> thereby targeting a different class of use cases.
> 
> AsterixDB's data storage and indexing facilities are similar to those
> of HBase, but AsterixDB differs in being a much more complete and
> queryable BDMS (not just a key-value style store).
> 
> AsterixDB's target use cases are not in-memory processing or
> iterative algorithm support, making AsterixDB complementary to the
> Apache Spark platform. (Spark interoperability is on our longer-term
> to-do wishlist.)
> 
> 
> Homogeneous Developers
> 
> As mentioned before the current community is already organizationally
> and geographically distributed - and we would like to increase the
> heterogeneity.
> 
> 
> Reliance on Salaried Developers
> 
> Of the initial committers only 3 are full-time UCI staff. The other
> committers are a mix of students, alumni who continue to contribute
> to the effort, and individuals working with permission part-time (or
> in spare time) on this project.
> 
> 
> A Excessive Fascination with the Apache Brand
> 
> We believe in the processes, systems, and framework Apache has put in
> place. Apache is also known to foster a great community around their
> projects and provide exposure. While brand is important, our
> fascination with it is not excessive. We believe that the ASF is the
> right home for AsterixDB and that having AsterixDB inside of the ASF
> will lead to a better long-term outcome for the Big Data community.
> 
> 
> Documentation
> 
> Documentation and publications related to AsterixDB can be found at
> http://asterixdb.ics.uci.edu/.
> 
> 
> Initial Source
> 
> Current source resides in Google code:
> https://code.google.com/p/asterixdb/ (query language and upper system
> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
> system and storage management libraries).
> 
> 
> External Dependencies
> 
> AsterixDB depends on a number of Apache projects:
> 
> - Ant
> - Avro
> - ApacheDB JDO
> - Commons
> - Derby
> - Hadoop
> - Hive
> - HTTPComponents
> - Jakarta ORO
> - Maven
> - Tomcat
> - Thrift
> - Velocity
> - Wicket
> - Xerces
> 
> and other open source projects (organized by license):
> 
> -- ASL 2.0:
> - Jackson
> - Google Guava
> - Google Guice
> - JSON-simple
> - BoneCP
> - Microsoft Azure SDK
> - Netty
> - Rome
> - JetS3t
> - Groovy
> - Jettison
> - Plexus
> - Datanucleus (JDO)
> - Jetty
> - Twitter4J
> - Snappy-java
> 
> -- BSD:
> - Antlr
> - ObjectWeb ASM
> - Protobuf
> - JSCH
> - JavaCC
> - Paranamer
> - JLine
> - Stax
> - StringTemplate
> - xmlEnc
> 
> -- MIT
> - AppAssembler
> - SimpleLog4J
> 
> -- CDDL 1.0
> - Java Activation Framework
> - Java Transactions
> - Java Servlet API
> - Grizzly
> - gmbal
> - Glassfish
> 
> -- CDDL 1.1
> - Jersey
> - JAXB Reference Implementation
> 
> -- JSON License
> - JSON
> 
> -- EPL 1.0
> - JUnit
> 
> -- JDOM License
> - JDOM
> 
> -- Public Domain
> - xz
> - AOPAlliance
> 
> As all dependencies are managed using Apache Maven, none of the
> external libraries need to be packaged in a source distribution.
> 
> 
> Required Resources
> 
> Developer and user mailing lists
> 
> private@asterixdb.incubator.apache.org (with moderated subscriptions)
> commits@asterixdb.incubator.apache.org
> dev@asterixdb.incubator.apache.org
> users@asterixdb.incubator.apache.org
> 
> 
> A git repository
> 
> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
> 
> 
> A JIRA issue tracker
> 
> https://issues.apache.org/jira/browse/ASTERIXDB
> 
> 
> Initial Committers
> 
> The following is a list of the planned initial Apache committers (the
> active subset of the committers for the current repository at Google
> code).
> 
> Abdullah Alamoudi (bamousaa@gmail.com)
> Cameron Samak (eufery@gmail.com)
> Chen Li (chenli@gmail.com)
> Ian Maxon (imaxon@uci.edu)
> Ildar Absalyamov (ildar.absalyamov@gmail.com)
> Jianfeng Jia (jianfeng.jia@gmail.com)
> Karen Ouaknine (kereno@gmail.com)
> Markus Dreseler (apache@dreseler.de)
> Mike Carey (dtabass@apache.org)
> Murtadha Hubail (hubailmor@gmail.com)
> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
> Preston Carman (prestonc@apache.org)
> Raman Grover (RamanGrover29@gmail.com)
> Sattam Alsubaiee (salsubaiee@gmail.com)
> Steven Jacobs (sjaco002@apache.org)
> Taewoo Kim (wangsaeu@gmail.com)
> Till Westmann (tillw@apache.org)
> Vinayak Borkar (vinayakb@apache.org)
> Yingyi Bu (buyingyi@gmail.com)
> Young-Seok Kim (kisskys@gmail.com)
> Zach Heilbron (zheilbron@gmail.com)
> 
> 
> Affiliations
> 
> UC Irvine
> - Mike Carey
> - Chen Li
> - Ian Maxon
> - Yingyi Bu
> - Raman Grover
> - Pouria Pirzadeh
> - Young-Seok Kim
> - Cameron Samak
> - Taewoo Kim
> - Jianfeng Jia
> - Murtadha Hubail
> - Markus Dreseler
> 
> UC Riverside
> - Ildar Absalyamov
> - Preston Carman
> - Steven Jacobs
> 
> Hebrew University
> - Keren Ouaknine
> 
> Oracle
> - Till Westmann
> 
> X15 Software
> - Vinayak Borkar
> - Zach Heilbron
> 
> KACST Saudi Arabia
> - Sattam Alsubaiee
> 
> Saudi Aramco
> - Abdullah Alamoudi
> 
> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
> non-UC committers are a mix of alumni who continue to contribute to
> the effort and individuals working with permission part-time (or in
> spare time) on this project.
> 
> 
> Sponsors
> 
> Champion
> 
> Chris Mattmann (NASA/JPL)
> 
> Nominated Mentors
> 
> TBD
> 
> Sponsoring Entity
> 
> The Apache Incubator
> 
> 
> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message