incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan D. Cabrera" <l...@toolazydogs.com>
Subject Re: [PROPOSAL] Apache AsterixDB Incubator
Date Tue, 20 Jan 2015 14:26:06 GMT
Should be fine.


Regards,
Alan

> On Jan 19, 2015, at 8:27 PM, Till Westmann <till@westmann.org> wrote:
> 
> Thank you.
> So for we’ve added 3 slots for mentors on the proposal - I hope that’ll be sufficient
even for the relatively large number of new committers.
> 
> Till
> 
>> On Jan 19, 2015, at 8:17 PM, Henry Saputra <henry.saputra@gmail.com> wrote:
>> 
>> Thanks Till,
>> 
>> Will try to solicit more mentors to help.
>> Especially with initial committers mostly have not been exposed to
>> contributing the Apache way.
>> 
>> - Henry
>> 
>> On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann <till@westmann.org> wrote:
>>> Hi Henry,
>>> 
>>> thanks! It’s great that you’ve seen (and liked) AsterixDB before.
>>> 
>>> Even if your time is very limited we would be very happy to have you on board
as a mentor.
>>> I’ll add you to the proposal.
>>> 
>>> Cheers,
>>> Till
>>> 
>>>> On Jan 19, 2015, at 10:26 AM, Henry Saputra <henry.saputra@gmail.com>
wrote:
>>>> 
>>>> +1 This is GREAT News!
>>>> 
>>>> Was watching and trying AsterixDB last year and looked in awesome shape.
>>>> 
>>>> I have my plate full but would love to help mentor this project to get
>>>> it going to ASF if needed!
>>>> 
>>>> - Henry
>>>> 
>>>> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
>>>> <chris.a.mattmann@jpl.nasa.gov> wrote:
>>>>> Hi Folks,
>>>>> 
>>>>> I am pleased to bring forth the Apache AsterixDB proposal to the
>>>>> Apache Incubator as Champion, working in collaboration with the
>>>>> team. Please find the wiki proposal here:
>>>>> 
>>>>> https://wiki.apache.org/incubator/AsterixDBProposal
>>>>> 
>>>>> 
>>>>> Full text of the proposal is below. Please discuss and enjoy. I’ll
>>>>> leave the discussion open for a week, and then look to call a VOTE
>>>>> hopefully end of next week if all is well.
>>>>> 
>>>>> Cheers!
>>>>> Chris Mattmann
>>>>> 
>>>>> =============================================================
>>>>> Apache AsterixDB Proposal
>>>>> 
>>>>> Abstract
>>>>> 
>>>>> Apache AsterixDB is a scalable big data management system (BDMS) that
>>>>> provides storage, management, and query capabilities for large
>>>>> collections of semi-structured data.
>>>>> 
>>>>> Proposal
>>>>> 
>>>>> AsterixDB is a big data management system (BDMS) that makes it
>>>>> well-suited to needs such as web data warehousing and social data
>>>>> storage and analysis. Feature-wise, AsterixDB has:
>>>>> 
>>>>> * A NoSQL style data model (ADM) based on extending JSON with object
>>>>> database concepts.
>>>>> * An expressive and declarative query language (AQL) for querying
>>>>> semi-structured data.
>>>>> * A runtime query execution engine, Hyracks, for partitioned-parallel
>>>>> execution of query plans.
>>>>> * Partitioned LSM-based data storage and indexing for efficient
>>>>> ingestion of newly arriving data.
>>>>> * Support for querying and indexing external data (e.g., in HDFS) as
>>>>> well as data stored within AsterixDB.
>>>>> * A rich set of primitive data types, including support for spatial,
>>>>> temporal, and textual data.
>>>>> * Indexing options that include B+ trees, R trees, and inverted
>>>>> keyword index support.
>>>>> * Basic transactional (concurrency and recovery) capabilities akin to
>>>>> those of a NoSQL store.
>>>>> 
>>>>> 
>>>>> Background and Rationale
>>>>> 
>>>>> In the world of relational databases, the need to tackle data volumes
>>>>> that exceed the capabilities of a single server led to the
>>>>> development of “shared-nothing” parallel database systems several
>>>>> decades ago. These systems spread data over a cluster based on a
>>>>> partitioning strategy, such as hash partitioning, and queries are
>>>>> processed by employing partitioned-parallel divide-and-conquer
>>>>> techniques. Since these systems are fronted by a high-level,
>>>>> declarative language (SQL), their users are shielded from the
>>>>> complexities of parallel programming. Parallel database systems have
>>>>> been an extremely successful application of parallel computing, and
>>>>> quite a number of commercial products exist today.
>>>>> 
>>>>> In the distributed systems world, the Web brought a need to index and
>>>>> query its huge content. SQL and relational databases were not the
>>>>> answer, though shared-nothing clusters again emerged as the hardware
>>>>> platform of choice. Google developed the Google File System (GFS) and
>>>>> MapReduce programming model to allow programmers to store and process
>>>>> Big Data by writing a few user-defined functions. The MapReduce
>>>>> framework applies these functions in parallel to data instances in
>>>>> distributed files (map) and to sorted groups of instances sharing a
>>>>> common key (reduce) -- not unlike the partitioned parallelism in
>>>>> parallel database systems. Apache's Hadoop MapReduce platform is the
>>>>> most prominent implementation of this paradigm for the rest of the
>>>>> Big Data community. On top of Hadoop and HDFS sit declarative
>>>>> languages like Pig and Hive that each compile down to Hadoop
>>>>> MapReduce jobs.
>>>>> 
>>>>> The big Web companies were also challenged by extreme user bases
>>>>> (100s of millions of users) and needed fast simple lookups and
>>>>> updates to very large keyed data sets like user profiles. SQL
>>>>> databases were deemed either too expensive or not scalable, so the
>>>>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
>>>>> popular key-value stores, in this space. MongoDB and Couchbase are
>>>>> other open source alternatives (document stores).
>>>>> 
>>>>> It is evident from the rapidly growing popularity of "NoSQL" stores,
>>>>> as well as the strong demand for Big Data analytics engines today,
>>>>> that there is a strong (and growing!) need to store, process, *and*
>>>>> query large volumes of semi-structured data in many application
>>>>> areas. Until very recently, developers have had to ``choose'' between
>>>>> using big data analytics engines like Apache Hive or Apache Spark,
>>>>> which can do complex query processing and analysis over HDFS-resident
>>>>> files, and flexible but low-function data stores like MongoDB or
>>>>> Apache HBase. (The Apache Phoenix project,
>>>>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>>>>> aims to bridge between these choices.)
>>>>> 
>>>>> AsterixDB is a highly scalable data management system that can store,
>>>>> index, and manage semi-structured data, e.g., much like MongoDB, but
>>>>> it also supports a full-power query language with the expressiveness
>>>>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
>>>>> stores and manages data, so AsterixDB can exploit its knowledge of
>>>>> data partitioning and the availability of indexes to avoid always
>>>>> scanning data set(s) to process queries. Somewhat surprisingly, there
>>>>> is no open source parallel database system (relational or otherwise)
>>>>> available to developers today -- AsterixDB aims to fill this need.
>>>>> Since Apache is where the majority of the today's most important Big
>>>>> Data technologies live, the ASF seems like the obvious home for a
>>>>> system like AsterixDB.
>>>>> 
>>>>> Current Status
>>>>> 
>>>>> The current version of AsterixDB was co-developed by a team of
>>>>> faculty, staff, and students at UC Irvine and UC Riverside. The
>>>>> project was initiated as a large NSF-sponsored project in 2009, the
>>>>> goal of which was to combine the best ideas from the parallel
>>>>> database world, the then new Hadoop world, and the semi-structured
>>>>> (e.g., XML/JSON) data world in order to create a next-generation
>>>>> BDMS. A first informal open source release was made four years later,
>>>>> in June of 2013, under the Apache Software License 2.0.
>>>>> 
>>>>> 
>>>>> Meritocracy
>>>>> 
>>>>> The current developers are familiar with meritocratic open source
>>>>> development at Apache. Apache was chosen specifically because we want
>>>>> to encourage this style of development for the project.
>>>>> 
>>>>> 
>>>>> Community
>>>>> 
>>>>> While AsterixDB started as a university project it has developed into
>>>>> a community. A number of the initial committers started contributing
>>>>> in academia and continue to actively participate and contribute after
>>>>> graduation. And we seek to further develop developer and user
>>>>> communities. One way to broaden the community that is ongoing is
>>>>> through academic collaborations (currently with IIT Mumbai in India
>>>>> and TU Berlin in Germany). During incubation we will also explicitly
>>>>> seek increased industrial participation.
>>>>> 
>>>>> Some indicators of the effort's development community and history can
>>>>> be
>>>>> found at:
>>>>> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
>>>>> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>>>>> 
>>>>> 
>>>>> Core Developers
>>>>> 
>>>>> The core developers of the project are diverse, although initially UC
>>>>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
>>>>> other 50 are from other academic institutions (UC Riverside and the
>>>>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
>>>>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>>>>> 
>>>>> 
>>>>> Alignment
>>>>> 
>>>>> Apache is, by far, the most natural home for taking the AsterixDB
>>>>> project forward. A large fraction of today's top Big Data
>>>>> technologies have their homes in Apache, including Hadoop, YARN, Pig,
>>>>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>>>>> significant gap -- the parallel data management system gap -- that
>>>>> exists in the Big Data open source world. It is well-aligned with a
>>>>> number of the Apache projects, e.g., it has strong support for
>>>>> accessing and indexing external data in HDFS, and it uses YARN as an
>>>>> answer to basic cluster resource management. AsterixDB also seeks to
>>>>> achieve an Apache-style development model; it is seeking a broader
>>>>> community of contributors and users in order to achieve its full
>>>>> potential and value to the Big Data community.
>>>>> 
>>>>> There are also a number of related Apache projects and dependencies
>>>>> that will be mentioned below in the Relationships with Other Apache
>>>>> products section.
>>>>> 
>>>>> 
>>>>> Known Risks
>>>>> 
>>>>> Orphaned products
>>>>> 
>>>>> Given the current level of intellectual investment in AsterixDB, the
>>>>> risk of the project being abandoned is very small. The UCI/UCR
>>>>> faculty team leads are highly incentivized to continue development
>>>>> since the database groups at UC Irvine and UC Riverside are both
>>>>> reliant on AsterixDB as a platform for long-term graduate research
>>>>> projects. UC San Diego is also beginning to contribute to the code
>>>>> base, and a collaboration involving public health applications is
>>>>> forming with UCLA. The work on AsterixDB is managed via a mix of
>>>>> mailing list discussions supplemented by weekly project status
>>>>> meetings which are summarized on the mailing list. Typical (local
>>>>> plus Skype-in) attendance to the weekly status meetings runs at about
>>>>> 20 active contributors.
>>>>> 
>>>>> 
>>>>> Inexperience with Open Source
>>>>> 
>>>>> AsterixDB and Hyracks were completely developed in Open Source under
>>>>> the ASL 2.0. The source code repositories, issue tracker, and mailing
>>>>> lists are available on Google Code and discussions and decisions
>>>>> happen on the mailing lists (which is necessary due to the geographic
>>>>> distribution of the current developers).
>>>>> 
>>>>> Also a few of the initial committers have contributed to Apache
>>>>> projects. Vinayak Borkar is a committer on the Apache Helix and
>>>>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
>>>>> and an IPMC member. Preston Carman and Steven Jacobs are committers
>>>>> on the Apache VXQuery project.
>>>>> 
>>>>> 
>>>>> Relationships with Other Apache Products
>>>>> 
>>>>> Apache VXQuery is based on the Hyracks data-parallel runtime, which
>>>>> is also included in the AsterixDB code base.
>>>>> 
>>>>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
>>>>> is support for accessing external data in HDFS (and Hive formats),
>>>>> and resource management and system administration features are in the
>>>>> process of being migrated to YARN.
>>>>> 
>>>>> AsterixDB's AQL query facilities offer comparable query power to
>>>>> Apache's Pig and Hive systems for big data analytics. AsterixDB
>>>>> differs in storing and indexing data and thus being able to quickly
>>>>> answer small and medium queries without large HDFS data scans -
>>>>> thereby targeting a different class of use cases.
>>>>> 
>>>>> AsterixDB's data storage and indexing facilities are similar to those
>>>>> of HBase, but AsterixDB differs in being a much more complete and
>>>>> queryable BDMS (not just a key-value style store).
>>>>> 
>>>>> AsterixDB's target use cases are not in-memory processing or
>>>>> iterative algorithm support, making AsterixDB complementary to the
>>>>> Apache Spark platform. (Spark interoperability is on our longer-term
>>>>> to-do wishlist.)
>>>>> 
>>>>> 
>>>>> Homogeneous Developers
>>>>> 
>>>>> As mentioned before the current community is already organizationally
>>>>> and geographically distributed - and we would like to increase the
>>>>> heterogeneity.
>>>>> 
>>>>> 
>>>>> Reliance on Salaried Developers
>>>>> 
>>>>> Of the initial committers only 3 are full-time UCI staff. The other
>>>>> committers are a mix of students, alumni who continue to contribute
>>>>> to the effort, and individuals working with permission part-time (or
>>>>> in spare time) on this project.
>>>>> 
>>>>> 
>>>>> A Excessive Fascination with the Apache Brand
>>>>> 
>>>>> We believe in the processes, systems, and framework Apache has put in
>>>>> place. Apache is also known to foster a great community around their
>>>>> projects and provide exposure. While brand is important, our
>>>>> fascination with it is not excessive. We believe that the ASF is the
>>>>> right home for AsterixDB and that having AsterixDB inside of the ASF
>>>>> will lead to a better long-term outcome for the Big Data community.
>>>>> 
>>>>> 
>>>>> Documentation
>>>>> 
>>>>> Documentation and publications related to AsterixDB can be found at
>>>>> http://asterixdb.ics.uci.edu/.
>>>>> 
>>>>> 
>>>>> Initial Source
>>>>> 
>>>>> Current source resides in Google code:
>>>>> https://code.google.com/p/asterixdb/ (query language and upper system
>>>>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>>>>> system and storage management libraries).
>>>>> 
>>>>> 
>>>>> External Dependencies
>>>>> 
>>>>> AsterixDB depends on a number of Apache projects:
>>>>> 
>>>>> - Ant
>>>>> - Avro
>>>>> - ApacheDB JDO
>>>>> - Commons
>>>>> - Derby
>>>>> - Hadoop
>>>>> - Hive
>>>>> - HTTPComponents
>>>>> - Jakarta ORO
>>>>> - Maven
>>>>> - Tomcat
>>>>> - Thrift
>>>>> - Velocity
>>>>> - Wicket
>>>>> - Xerces
>>>>> 
>>>>> and other open source projects (organized by license):
>>>>> 
>>>>> -- ASL 2.0:
>>>>> - Jackson
>>>>> - Google Guava
>>>>> - Google Guice
>>>>> - JSON-simple
>>>>> - BoneCP
>>>>> - Microsoft Azure SDK
>>>>> - Netty
>>>>> - Rome
>>>>> - JetS3t
>>>>> - Groovy
>>>>> - Jettison
>>>>> - Plexus
>>>>> - Datanucleus (JDO)
>>>>> - Jetty
>>>>> - Twitter4J
>>>>> - Snappy-java
>>>>> 
>>>>> -- BSD:
>>>>> - Antlr
>>>>> - ObjectWeb ASM
>>>>> - Protobuf
>>>>> - JSCH
>>>>> - JavaCC
>>>>> - Paranamer
>>>>> - JLine
>>>>> - Stax
>>>>> - StringTemplate
>>>>> - xmlEnc
>>>>> 
>>>>> -- MIT
>>>>> - AppAssembler
>>>>> - SimpleLog4J
>>>>> 
>>>>> -- CDDL 1.0
>>>>> - Java Activation Framework
>>>>> - Java Transactions
>>>>> - Java Servlet API
>>>>> - Grizzly
>>>>> - gmbal
>>>>> - Glassfish
>>>>> 
>>>>> -- CDDL 1.1
>>>>> - Jersey
>>>>> - JAXB Reference Implementation
>>>>> 
>>>>> -- JSON License
>>>>> - JSON
>>>>> 
>>>>> -- EPL 1.0
>>>>> - JUnit
>>>>> 
>>>>> -- JDOM License
>>>>> - JDOM
>>>>> 
>>>>> -- Public Domain
>>>>> - xz
>>>>> - AOPAlliance
>>>>> 
>>>>> As all dependencies are managed using Apache Maven, none of the
>>>>> external libraries need to be packaged in a source distribution.
>>>>> 
>>>>> 
>>>>> Required Resources
>>>>> 
>>>>> Developer and user mailing lists
>>>>> 
>>>>> private@asterixdb.incubator.apache.org (with moderated subscriptions)
>>>>> commits@asterixdb.incubator.apache.org
>>>>> dev@asterixdb.incubator.apache.org
>>>>> users@asterixdb.incubator.apache.org
>>>>> 
>>>>> 
>>>>> A git repository
>>>>> 
>>>>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>>>>> 
>>>>> 
>>>>> A JIRA issue tracker
>>>>> 
>>>>> https://issues.apache.org/jira/browse/ASTERIXDB
>>>>> 
>>>>> 
>>>>> Initial Committers
>>>>> 
>>>>> The following is a list of the planned initial Apache committers (the
>>>>> active subset of the committers for the current repository at Google
>>>>> code).
>>>>> 
>>>>> Abdullah Alamoudi (bamousaa@gmail.com)
>>>>> Cameron Samak (eufery@gmail.com)
>>>>> Chen Li (chenli@gmail.com)
>>>>> Ian Maxon (imaxon@uci.edu)
>>>>> Ildar Absalyamov (ildar.absalyamov@gmail.com)
>>>>> Jianfeng Jia (jianfeng.jia@gmail.com)
>>>>> Karen Ouaknine (kereno@gmail.com)
>>>>> Markus Dreseler (apache@dreseler.de)
>>>>> Mike Carey (dtabass@apache.org)
>>>>> Murtadha Hubail (hubailmor@gmail.com)
>>>>> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
>>>>> Preston Carman (prestonc@apache.org)
>>>>> Raman Grover (RamanGrover29@gmail.com)
>>>>> Sattam Alsubaiee (salsubaiee@gmail.com)
>>>>> Steven Jacobs (sjaco002@apache.org)
>>>>> Taewoo Kim (wangsaeu@gmail.com)
>>>>> Till Westmann (tillw@apache.org)
>>>>> Vinayak Borkar (vinayakb@apache.org)
>>>>> Yingyi Bu (buyingyi@gmail.com)
>>>>> Young-Seok Kim (kisskys@gmail.com)
>>>>> Zach Heilbron (zheilbron@gmail.com)
>>>>> 
>>>>> 
>>>>> Affiliations
>>>>> 
>>>>> UC Irvine
>>>>> - Mike Carey
>>>>> - Chen Li
>>>>> - Ian Maxon
>>>>> - Yingyi Bu
>>>>> - Raman Grover
>>>>> - Pouria Pirzadeh
>>>>> - Young-Seok Kim
>>>>> - Cameron Samak
>>>>> - Taewoo Kim
>>>>> - Jianfeng Jia
>>>>> - Murtadha Hubail
>>>>> - Markus Dreseler
>>>>> 
>>>>> UC Riverside
>>>>> - Ildar Absalyamov
>>>>> - Preston Carman
>>>>> - Steven Jacobs
>>>>> 
>>>>> Hebrew University
>>>>> - Keren Ouaknine
>>>>> 
>>>>> Oracle
>>>>> - Till Westmann
>>>>> 
>>>>> X15 Software
>>>>> - Vinayak Borkar
>>>>> - Zach Heilbron
>>>>> 
>>>>> KACST Saudi Arabia
>>>>> - Sattam Alsubaiee
>>>>> 
>>>>> Saudi Aramco
>>>>> - Abdullah Alamoudi
>>>>> 
>>>>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
>>>>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>>>>> non-UC committers are a mix of alumni who continue to contribute to
>>>>> the effort and individuals working with permission part-time (or in
>>>>> spare time) on this project.
>>>>> 
>>>>> 
>>>>> Sponsors
>>>>> 
>>>>> Champion
>>>>> 
>>>>> Chris Mattmann (NASA/JPL)
>>>>> 
>>>>> Nominated Mentors
>>>>> 
>>>>> TBD
>>>>> 
>>>>> Sponsoring Entity
>>>>> 
>>>>> The Apache Incubator
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Chief Architect
>>>>> Instrument Software and Science Data Systems Section (398)
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 168-519, Mailstop: 168-527
>>>>> Email: chris.a.mattmann@nasa.gov
>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Associate Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message