incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Carey <dtab...@gmail.com>
Subject Re: [PROPOSAL] Apache AsterixDB Incubator
Date Tue, 20 Jan 2015 01:47:38 GMT
Ditto - thanks for the support!
Cheers,
Mike

On 1/19/15 5:39 PM, Till Westmann wrote:
>
>> On Jan 19, 2015, at 11:34 AM, jan i <jani@apache.org 
>> <mailto:jani@apache.org>> wrote:
>>
>> Looks like a real challenging project, and the proposal looks as if 
>> it has already been through a couple of refinement rounds.
>>
>> Count on my +1, when it comes to voting.
>
> Will do!
>
> Thanks,
> Till
>
>>
>> rgds
>> jan i
>>
>> On 19 January 2015 at 19:26, Henry Saputra <henry.saputra@gmail.com 
>> <mailto:henry.saputra@gmail.com>> wrote:
>>
>>     +1 This is GREAT News!
>>
>>     Was watching and trying AsterixDB last year and looked in awesome
>>     shape.
>>
>>     I have my plate full but would love to help mentor this project
>>     to get
>>     it going to ASF if needed!
>>
>>     - Henry
>>
>>     On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
>>     <chris.a.mattmann@jpl.nasa.gov
>>     <mailto:chris.a.mattmann@jpl.nasa.gov>> wrote:
>>     > Hi Folks,
>>     >
>>     > I am pleased to bring forth the Apache AsterixDB proposal to the
>>     > Apache Incubator as Champion, working in collaboration with the
>>     > team. Please find the wiki proposal here:
>>     >
>>     > https://wiki.apache.org/incubator/AsterixDBProposal
>>     >
>>     >
>>     > Full text of the proposal is below. Please discuss and enjoy. I’ll
>>     > leave the discussion open for a week, and then look to call a VOTE
>>     > hopefully end of next week if all is well.
>>     >
>>     > Cheers!
>>     > Chris Mattmann
>>     >
>>     > =============================================================
>>     > Apache AsterixDB Proposal
>>     >
>>     > Abstract
>>     >
>>     > Apache AsterixDB is a scalable big data management system
>>     (BDMS) that
>>     > provides storage, management, and query capabilities for large
>>     > collections of semi-structured data.
>>     >
>>     > Proposal
>>     >
>>     > AsterixDB is a big data management system (BDMS) that makes it
>>     > well-suited to needs such as web data warehousing and social data
>>     > storage and analysis. Feature-wise, AsterixDB has:
>>     >
>>     > * A NoSQL style data model (ADM) based on extending JSON with
>>     object
>>     >   database concepts.
>>     > * An expressive and declarative query language (AQL) for querying
>>     >   semi-structured data.
>>     > * A runtime query execution engine, Hyracks, for
>>     partitioned-parallel
>>     >   execution of query plans.
>>     > * Partitioned LSM-based data storage and indexing for efficient
>>     >   ingestion of newly arriving data.
>>     > * Support for querying and indexing external data (e.g., in
>>     HDFS) as
>>     >   well as data stored within AsterixDB.
>>     > * A rich set of primitive data types, including support for
>>     spatial,
>>     >   temporal, and textual data.
>>     > * Indexing options that include B+ trees, R trees, and inverted
>>     >   keyword index support.
>>     > * Basic transactional (concurrency and recovery) capabilities
>>     akin to
>>     >   those of a NoSQL store.
>>     >
>>     >
>>     > Background and Rationale
>>     >
>>     > In the world of relational databases, the need to tackle data
>>     volumes
>>     > that exceed the capabilities of a single server led to the
>>     > development of “shared-nothing” parallel database systems several
>>     > decades ago. These systems spread data over a cluster based on a
>>     > partitioning strategy, such as hash partitioning, and queries are
>>     > processed by employing partitioned-parallel divide-and-conquer
>>     > techniques. Since these systems are fronted by a high-level,
>>     > declarative language (SQL), their users are shielded from the
>>     > complexities of parallel programming. Parallel database systems
>>     have
>>     > been an extremely successful application of parallel computing, and
>>     > quite a number of commercial products exist today.
>>     >
>>     > In the distributed systems world, the Web brought a need to
>>     index and
>>     > query its huge content. SQL and relational databases were not the
>>     > answer, though shared-nothing clusters again emerged as the
>>     hardware
>>     > platform of choice. Google developed the Google File System
>>     (GFS) and
>>     > MapReduce programming model to allow programmers to store and
>>     process
>>     > Big Data by writing a few user-defined functions. The MapReduce
>>     > framework applies these functions in parallel to data instances in
>>     > distributed files (map) and to sorted groups of instances sharing a
>>     > common key (reduce) -- not unlike the partitioned parallelism in
>>     > parallel database systems. Apache's Hadoop MapReduce platform
>>     is the
>>     > most prominent implementation of this paradigm for the rest of the
>>     > Big Data community. On top of Hadoop and HDFS sit declarative
>>     > languages like Pig and Hive that each compile down to Hadoop
>>     > MapReduce jobs.
>>     >
>>     > The big Web companies were also challenged by extreme user bases
>>     > (100s of millions of users) and needed fast simple lookups and
>>     > updates to very large keyed data sets like user profiles. SQL
>>     > databases were deemed either too expensive or not scalable, so the
>>     > “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
>>     > popular key-value stores, in this space. MongoDB and Couchbase are
>>     > other open source alternatives (document stores).
>>     >
>>     > It is evident from the rapidly growing popularity of "NoSQL"
>>     stores,
>>     > as well as the strong demand for Big Data analytics engines today,
>>     > that there is a strong (and growing!) need to store, process, *and*
>>     > query large volumes of semi-structured data in many application
>>     > areas. Until very recently, developers have had to ``choose''
>>     between
>>     > using big data analytics engines like Apache Hive or Apache Spark,
>>     > which can do complex query processing and analysis over
>>     HDFS-resident
>>     > files, and flexible but low-function data stores like MongoDB or
>>     > Apache HBase. (The Apache Phoenix project,
>>     > http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>>     > aims to bridge between these choices.)
>>     >
>>     > AsterixDB is a highly scalable data management system that can
>>     store,
>>     > index, and manage semi-structured data, e.g., much like
>>     MongoDB, but
>>     > it also supports a full-power query language with the
>>     expressiveness
>>     > of SQL (and more). Unlike analytics engines like Hive or Spark, it
>>     > stores and manages data, so AsterixDB can exploit its knowledge of
>>     > data partitioning and the availability of indexes to avoid always
>>     > scanning data set(s) to process queries. Somewhat surprisingly,
>>     there
>>     > is no open source parallel database system (relational or
>>     otherwise)
>>     > available to developers today -- AsterixDB aims to fill this need.
>>     > Since Apache is where the majority of the today's most
>>     important Big
>>     > Data technologies live, the ASF seems like the obvious home for a
>>     > system like AsterixDB.
>>     >
>>     > Current Status
>>     >
>>     > The current version of AsterixDB was co-developed by a team of
>>     > faculty, staff, and students at UC Irvine and UC Riverside. The
>>     > project was initiated as a large NSF-sponsored project in 2009, the
>>     > goal of which was to combine the best ideas from the parallel
>>     > database world, the then new Hadoop world, and the semi-structured
>>     > (e.g., XML/JSON) data world in order to create a next-generation
>>     > BDMS. A first informal open source release was made four years
>>     later,
>>     > in June of 2013, under the Apache Software License 2.0.
>>     >
>>     >
>>     > Meritocracy
>>     >
>>     > The current developers are familiar with meritocratic open source
>>     > development at Apache. Apache was chosen specifically because
>>     we want
>>     > to encourage this style of development for the project.
>>     >
>>     >
>>     > Community
>>     >
>>     > While AsterixDB started as a university project it has
>>     developed into
>>     > a community. A number of the initial committers started
>>     contributing
>>     > in academia and continue to actively participate and contribute
>>     after
>>     > graduation. And we seek to further develop developer and user
>>     > communities. One way to broaden the community that is ongoing is
>>     > through academic collaborations (currently with IIT Mumbai in India
>>     > and TU Berlin in Germany). During incubation we will also
>>     explicitly
>>     > seek increased industrial participation.
>>     >
>>     > Some indicators of the effort's development community and
>>     history can
>>     > be
>>     > found at:
>>     >
>>     https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
>>     >
>>     https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>>     >
>>     >
>>     > Core Developers
>>     >
>>     > The core developers of the project are diverse, although
>>     initially UC
>>     > Irvine heavy (roughly 50) due to the project's origins at UCI. The
>>     > other 50 are from other academic institutions (UC Riverside and the
>>     > Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
>>     > IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>>     >
>>     >
>>     > Alignment
>>     >
>>     > Apache is, by far, the most natural home for taking the AsterixDB
>>     > project forward. A large fraction of today's top Big Data
>>     > technologies have their homes in Apache, including Hadoop,
>>     YARN, Pig,
>>     > Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>>     > significant gap -- the parallel data management system gap -- that
>>     > exists in the Big Data open source world. It is well-aligned with a
>>     > number of the Apache projects, e.g., it has strong support for
>>     > accessing and indexing external data in HDFS, and it uses YARN
>>     as an
>>     > answer to basic cluster resource management. AsterixDB also
>>     seeks to
>>     > achieve an Apache-style development model; it is seeking a broader
>>     > community of contributors and users in order to achieve its full
>>     > potential and value to the Big Data community.
>>     >
>>     > There are also a number of related Apache projects and dependencies
>>     > that will be mentioned below in the Relationships with Other Apache
>>     > products section.
>>     >
>>     >
>>     > Known Risks
>>     >
>>     > Orphaned products
>>     >
>>     > Given the current level of intellectual investment in
>>     AsterixDB, the
>>     > risk of the project being abandoned is very small. The UCI/UCR
>>     > faculty team leads are highly incentivized to continue development
>>     > since the database groups at UC Irvine and UC Riverside are both
>>     > reliant on AsterixDB as a platform for long-term graduate research
>>     > projects. UC San Diego is also beginning to contribute to the code
>>     > base, and a collaboration involving public health applications is
>>     > forming with UCLA. The work on AsterixDB is managed via a mix of
>>     > mailing list discussions supplemented by weekly project status
>>     > meetings which are summarized on the mailing list. Typical (local
>>     > plus Skype-in) attendance to the weekly status meetings runs at
>>     about
>>     > 20 active contributors.
>>     >
>>     >
>>     > Inexperience with Open Source
>>     >
>>     > AsterixDB and Hyracks were completely developed in Open Source
>>     under
>>     > the ASL 2.0. The source code repositories, issue tracker, and
>>     mailing
>>     > lists are available on Google Code and discussions and decisions
>>     > happen on the mailing lists (which is necessary due to the
>>     geographic
>>     > distribution of the current developers).
>>     >
>>     > Also a few of the initial committers have contributed to Apache
>>     > projects. Vinayak Borkar is a committer on the Apache Helix and
>>     > Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
>>     > and an IPMC member. Preston Carman and Steven Jacobs are committers
>>     > on the Apache VXQuery project.
>>     >
>>     >
>>     > Relationships with Other Apache Products
>>     >
>>     > Apache VXQuery is based on the Hyracks data-parallel runtime, which
>>     > is also included in the AsterixDB code base.
>>     >
>>     > AsterixDB is closely related to Apache Hadoop. Included in
>>     AsterixDB
>>     > is support for accessing external data in HDFS (and Hive formats),
>>     > and resource management and system administration features are
>>     in the
>>     > process of being migrated to YARN.
>>     >
>>     > AsterixDB's AQL query facilities offer comparable query power to
>>     > Apache's Pig and Hive systems for big data analytics. AsterixDB
>>     > differs in storing and indexing data and thus being able to quickly
>>     > answer small and medium queries without large HDFS data scans -
>>     > thereby targeting a different class of use cases.
>>     >
>>     > AsterixDB's data storage and indexing facilities are similar to
>>     those
>>     > of HBase, but AsterixDB differs in being a much more complete and
>>     > queryable BDMS (not just a key-value style store).
>>     >
>>     > AsterixDB's target use cases are not in-memory processing or
>>     > iterative algorithm support, making AsterixDB complementary to the
>>     > Apache Spark platform. (Spark interoperability is on our
>>     longer-term
>>     > to-do wishlist.)
>>     >
>>     >
>>     > Homogeneous Developers
>>     >
>>     > As mentioned before the current community is already
>>     organizationally
>>     > and geographically distributed - and we would like to increase the
>>     > heterogeneity.
>>     >
>>     >
>>     > Reliance on Salaried Developers
>>     >
>>     > Of the initial committers only 3 are full-time UCI staff. The other
>>     > committers are a mix of students, alumni who continue to contribute
>>     > to the effort, and individuals working with permission
>>     part-time (or
>>     > in spare time) on this project.
>>     >
>>     >
>>     > A Excessive Fascination with the Apache Brand
>>     >
>>     > We believe in the processes, systems, and framework Apache has
>>     put in
>>     > place. Apache is also known to foster a great community around
>>     their
>>     > projects and provide exposure. While brand is important, our
>>     > fascination with it is not excessive. We believe that the ASF
>>     is the
>>     > right home for AsterixDB and that having AsterixDB inside of
>>     the ASF
>>     > will lead to a better long-term outcome for the Big Data community.
>>     >
>>     >
>>     > Documentation
>>     >
>>     > Documentation and publications related to AsterixDB can be found at
>>     > http://asterixdb.ics.uci.edu/.
>>     >
>>     >
>>     > Initial Source
>>     >
>>     > Current source resides in Google code:
>>     > https://code.google.com/p/asterixdb/ (query language and upper
>>     system
>>     > layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>>     > system and storage management libraries).
>>     >
>>     >
>>     > External Dependencies
>>     >
>>     > AsterixDB depends on a number of Apache projects:
>>     >
>>     > - Ant
>>     > - Avro
>>     > - ApacheDB JDO
>>     > - Commons
>>     > - Derby
>>     > - Hadoop
>>     > - Hive
>>     > - HTTPComponents
>>     > - Jakarta ORO
>>     > - Maven
>>     > - Tomcat
>>     > - Thrift
>>     > - Velocity
>>     > - Wicket
>>     > - Xerces
>>     >
>>     > and other open source projects (organized by license):
>>     >
>>     > -- ASL 2.0:
>>     >  - Jackson
>>     >  - Google Guava
>>     >  - Google Guice
>>     >  - JSON-simple
>>     >  - BoneCP
>>     >  - Microsoft Azure SDK
>>     >  - Netty
>>     >  - Rome
>>     >  - JetS3t
>>     >  - Groovy
>>     >  - Jettison
>>     >  - Plexus
>>     >  - Datanucleus (JDO)
>>     >  - Jetty
>>     >  - Twitter4J
>>     >  - Snappy-java
>>     >
>>     > -- BSD:
>>     >  - Antlr
>>     >  - ObjectWeb ASM
>>     >  - Protobuf
>>     >  - JSCH
>>     >  - JavaCC
>>     >  - Paranamer
>>     >  - JLine
>>     >  - Stax
>>     >  - StringTemplate
>>     >  - xmlEnc
>>     >
>>     > -- MIT
>>     >  - AppAssembler
>>     >  - SimpleLog4J
>>     >
>>     > -- CDDL 1.0
>>     >  - Java Activation Framework
>>     >  - Java Transactions
>>     >  - Java Servlet API
>>     >  - Grizzly
>>     >  - gmbal
>>     >  - Glassfish
>>     >
>>     > -- CDDL 1.1
>>     >  - Jersey
>>     >  - JAXB Reference Implementation
>>     >
>>     > -- JSON License
>>     >  - JSON
>>     >
>>     > -- EPL 1.0
>>     >  - JUnit
>>     >
>>     > -- JDOM License
>>     >  - JDOM
>>     >
>>     > -- Public Domain
>>     >  - xz
>>     >  - AOPAlliance
>>     >
>>     > As all dependencies are managed using Apache Maven, none of the
>>     > external libraries need to be packaged in a source distribution.
>>     >
>>     >
>>     > Required Resources
>>     >
>>     > Developer and user mailing lists
>>     >
>>     > private@asterixdb.incubator.apache.org
>>     <mailto:private@asterixdb.incubator.apache.org> (with moderated
>>     subscriptions)
>>     > commits@asterixdb.incubator.apache.org
>>     <mailto:commits@asterixdb.incubator.apache.org>
>>     > dev@asterixdb.incubator.apache.org
>>     <mailto:dev@asterixdb.incubator.apache.org>
>>     > users@asterixdb.incubator.apache.org
>>     <mailto:users@asterixdb.incubator.apache.org>
>>     >
>>     >
>>     > A git repository
>>     >
>>     > https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>>     >
>>     >
>>     > A JIRA issue tracker
>>     >
>>     > https://issues.apache.org/jira/browse/ASTERIXDB
>>     >
>>     >
>>     > Initial Committers
>>     >
>>     > The following is a list of the planned initial Apache
>>     committers (the
>>     > active subset of the committers for the current repository at
>>     Google
>>     > code).
>>     >
>>     > Abdullah Alamoudi (bamousaa@gmail.com <mailto:bamousaa@gmail.com>)
>>     > Cameron Samak (eufery@gmail.com <mailto:eufery@gmail.com>)
>>     > Chen Li (chenli@gmail.com <mailto:chenli@gmail.com>)
>>     > Ian Maxon (imaxon@uci.edu <mailto:imaxon@uci.edu>)
>>     > Ildar Absalyamov (ildar.absalyamov@gmail.com
>>     <mailto:ildar.absalyamov@gmail.com>)
>>     > Jianfeng Jia (jianfeng.jia@gmail.com
>>     <mailto:jianfeng.jia@gmail.com>)
>>     > Karen Ouaknine (kereno@gmail.com <mailto:kereno@gmail.com>)
>>     > Markus Dreseler (apache@dreseler.de <mailto:apache@dreseler.de>)
>>     > Mike Carey (dtabass@apache.org <mailto:dtabass@apache.org>)
>>     > Murtadha Hubail (hubailmor@gmail.com <mailto:hubailmor@gmail.com>)
>>     > Pouria Pirzadeh (pouria.pirzadeh@gmail.com
>>     <mailto:pouria.pirzadeh@gmail.com>)
>>     > Preston Carman (prestonc@apache.org <mailto:prestonc@apache.org>)
>>     > Raman Grover (RamanGrover29@gmail.com
>>     <mailto:RamanGrover29@gmail.com>)
>>     > Sattam Alsubaiee (salsubaiee@gmail.com
>>     <mailto:salsubaiee@gmail.com>)
>>     > Steven Jacobs (sjaco002@apache.org <mailto:sjaco002@apache.org>)
>>     > Taewoo Kim (wangsaeu@gmail.com <mailto:wangsaeu@gmail.com>)
>>     > Till Westmann (tillw@apache.org <mailto:tillw@apache.org>)
>>     > Vinayak Borkar (vinayakb@apache.org <mailto:vinayakb@apache.org>)
>>     > Yingyi Bu (buyingyi@gmail.com <mailto:buyingyi@gmail.com>)
>>     > Young-Seok Kim (kisskys@gmail.com <mailto:kisskys@gmail.com>)
>>     > Zach Heilbron (zheilbron@gmail.com <mailto:zheilbron@gmail.com>)
>>     >
>>     >
>>     > Affiliations
>>     >
>>     > UC Irvine
>>     > - Mike Carey
>>     > - Chen Li
>>     > - Ian Maxon
>>     > - Yingyi Bu
>>     > - Raman Grover
>>     > - Pouria Pirzadeh
>>     > - Young-Seok Kim
>>     > - Cameron Samak
>>     > - Taewoo Kim
>>     > - Jianfeng Jia
>>     > - Murtadha Hubail
>>     > - Markus Dreseler
>>     >
>>     > UC Riverside
>>     > - Ildar Absalyamov
>>     > - Preston Carman
>>     > - Steven Jacobs
>>     >
>>     > Hebrew University
>>     > - Keren Ouaknine
>>     >
>>     > Oracle
>>     > - Till Westmann
>>     >
>>     > X15 Software
>>     > - Vinayak Borkar
>>     > - Zach Heilbron
>>     >
>>     > KACST Saudi Arabia
>>     > - Sattam Alsubaiee
>>     >
>>     > Saudi Aramco
>>     > - Abdullah Alamoudi
>>     >
>>     > Carey, Li, and Maxon are full-time UCI staff, with the
>>     remaining UCI
>>     > (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>>     > non-UC committers are a mix of alumni who continue to contribute to
>>     > the effort and individuals working with permission part-time (or in
>>     > spare time) on this project.
>>     >
>>     >
>>     > Sponsors
>>     >
>>     > Champion
>>     >
>>     > Chris Mattmann (NASA/JPL)
>>     >
>>     > Nominated Mentors
>>     >
>>     > TBD
>>     >
>>     > Sponsoring Entity
>>     >
>>     > The Apache Incubator
>>     >
>>     >
>>     >
>>     >
>>     >
>>     > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>     > Chris Mattmann, Ph.D.
>>     > Chief Architect
>>     > Instrument Software and Science Data Systems Section (398)
>>     > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>     > Office: 168-519, Mailstop: 168-527
>>     > Email: chris.a.mattmann@nasa.gov <mailto:chris.a.mattmann@nasa.gov>
>>     > WWW: http://sunset.usc.edu/~mattmann/
>>     <http://sunset.usc.edu/%7Emattmann/>
>>     > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>     > Adjunct Associate Professor, Computer Science Department
>>     > University of Southern California, Los Angeles, CA 90089 USA
>>     > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>     >
>>     >
>>     >
>>     >
>>
>>     ---------------------------------------------------------------------
>>     To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>     <mailto:general-unsubscribe@incubator.apache.org>
>>     For additional commands, e-mail:
>>     general-help@incubator.apache.org
>>     <mailto:general-help@incubator.apache.org>
>>
>>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message