incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Carey <dtab...@gmail.com>
Subject Re: [PROPOSAL] Apache AsterixDB Incubator
Date Tue, 20 Jan 2015 16:37:15 GMT
Wonderful; thanks, Ted!!
Cheers,
Mike

On 1/19/15 11:29 PM, Ted Dunning wrote:
>
> Chris just asked me under separate cover.
>
> I am happy to help out as mentor.
>
>
>
> On Mon, Jan 19, 2015 at 8:17 PM, Henry Saputra 
> <henry.saputra@gmail.com <mailto:henry.saputra@gmail.com>> wrote:
>
>     Thanks Till,
>
>     Will try to solicit more mentors to help.
>     Especially with initial committers mostly have not been exposed to
>     contributing the Apache way.
>
>     - Henry
>
>     On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann <till@westmann.org
>     <mailto:till@westmann.org>> wrote:
>     > Hi Henry,
>     >
>     > thanks! It’s great that you’ve seen (and liked) AsterixDB before.
>     >
>     > Even if your time is very limited we would be very happy to have
>     you on board as a mentor.
>     > I’ll add you to the proposal.
>     >
>     > Cheers,
>     > Till
>     >
>     >> On Jan 19, 2015, at 10:26 AM, Henry Saputra
>     <henry.saputra@gmail.com <mailto:henry.saputra@gmail.com>> wrote:
>     >>
>     >> +1 This is GREAT News!
>     >>
>     >> Was watching and trying AsterixDB last year and looked in
>     awesome shape.
>     >>
>     >> I have my plate full but would love to help mentor this project
>     to get
>     >> it going to ASF if needed!
>     >>
>     >> - Henry
>     >>
>     >> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
>     >> <chris.a.mattmann@jpl.nasa.gov
>     <mailto:chris.a.mattmann@jpl.nasa.gov>> wrote:
>     >>> Hi Folks,
>     >>>
>     >>> I am pleased to bring forth the Apache AsterixDB proposal to the
>     >>> Apache Incubator as Champion, working in collaboration with the
>     >>> team. Please find the wiki proposal here:
>     >>>
>     >>> https://wiki.apache.org/incubator/AsterixDBProposal
>     >>>
>     >>>
>     >>> Full text of the proposal is below. Please discuss and enjoy. I’ll
>     >>> leave the discussion open for a week, and then look to call a VOTE
>     >>> hopefully end of next week if all is well.
>     >>>
>     >>> Cheers!
>     >>> Chris Mattmann
>     >>>
>     >>> =============================================================
>     >>> Apache AsterixDB Proposal
>     >>>
>     >>> Abstract
>     >>>
>     >>> Apache AsterixDB is a scalable big data management system
>     (BDMS) that
>     >>> provides storage, management, and query capabilities for large
>     >>> collections of semi-structured data.
>     >>>
>     >>> Proposal
>     >>>
>     >>> AsterixDB is a big data management system (BDMS) that makes it
>     >>> well-suited to needs such as web data warehousing and social data
>     >>> storage and analysis. Feature-wise, AsterixDB has:
>     >>>
>     >>> * A NoSQL style data model (ADM) based on extending JSON with
>     object
>     >>>  database concepts.
>     >>> * An expressive and declarative query language (AQL) for querying
>     >>>  semi-structured data.
>     >>> * A runtime query execution engine, Hyracks, for
>     partitioned-parallel
>     >>>  execution of query plans.
>     >>> * Partitioned LSM-based data storage and indexing for efficient
>     >>>  ingestion of newly arriving data.
>     >>> * Support for querying and indexing external data (e.g., in
>     HDFS) as
>     >>>  well as data stored within AsterixDB.
>     >>> * A rich set of primitive data types, including support for
>     spatial,
>     >>>  temporal, and textual data.
>     >>> * Indexing options that include B+ trees, R trees, and inverted
>     >>>  keyword index support.
>     >>> * Basic transactional (concurrency and recovery) capabilities
>     akin to
>     >>>  those of a NoSQL store.
>     >>>
>     >>>
>     >>> Background and Rationale
>     >>>
>     >>> In the world of relational databases, the need to tackle data
>     volumes
>     >>> that exceed the capabilities of a single server led to the
>     >>> development of “shared-nothing” parallel database systems several
>     >>> decades ago. These systems spread data over a cluster based on a
>     >>> partitioning strategy, such as hash partitioning, and queries are
>     >>> processed by employing partitioned-parallel divide-and-conquer
>     >>> techniques. Since these systems are fronted by a high-level,
>     >>> declarative language (SQL), their users are shielded from the
>     >>> complexities of parallel programming. Parallel database
>     systems have
>     >>> been an extremely successful application of parallel
>     computing, and
>     >>> quite a number of commercial products exist today.
>     >>>
>     >>> In the distributed systems world, the Web brought a need to
>     index and
>     >>> query its huge content. SQL and relational databases were not the
>     >>> answer, though shared-nothing clusters again emerged as the
>     hardware
>     >>> platform of choice. Google developed the Google File System
>     (GFS) and
>     >>> MapReduce programming model to allow programmers to store and
>     process
>     >>> Big Data by writing a few user-defined functions. The MapReduce
>     >>> framework applies these functions in parallel to data instances in
>     >>> distributed files (map) and to sorted groups of instances
>     sharing a
>     >>> common key (reduce) -- not unlike the partitioned parallelism in
>     >>> parallel database systems. Apache's Hadoop MapReduce platform
>     is the
>     >>> most prominent implementation of this paradigm for the rest of the
>     >>> Big Data community. On top of Hadoop and HDFS sit declarative
>     >>> languages like Pig and Hive that each compile down to Hadoop
>     >>> MapReduce jobs.
>     >>>
>     >>> The big Web companies were also challenged by extreme user bases
>     >>> (100s of millions of users) and needed fast simple lookups and
>     >>> updates to very large keyed data sets like user profiles. SQL
>     >>> databases were deemed either too expensive or not scalable, so the
>     >>> “NoSQL movement” was born. The ASF now has HBase and
>     Cassandra, two
>     >>> popular key-value stores, in this space. MongoDB and Couchbase are
>     >>> other open source alternatives (document stores).
>     >>>
>     >>> It is evident from the rapidly growing popularity of "NoSQL"
>     stores,
>     >>> as well as the strong demand for Big Data analytics engines today,
>     >>> that there is a strong (and growing!) need to store, process,
>     *and*
>     >>> query large volumes of semi-structured data in many application
>     >>> areas. Until very recently, developers have had to ``choose''
>     between
>     >>> using big data analytics engines like Apache Hive or Apache Spark,
>     >>> which can do complex query processing and analysis over
>     HDFS-resident
>     >>> files, and flexible but low-function data stores like MongoDB or
>     >>> Apache HBase. (The Apache Phoenix project,
>     >>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>     >>> aims to bridge between these choices.)
>     >>>
>     >>> AsterixDB is a highly scalable data management system that can
>     store,
>     >>> index, and manage semi-structured data, e.g., much like
>     MongoDB, but
>     >>> it also supports a full-power query language with the
>     expressiveness
>     >>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
>     >>> stores and manages data, so AsterixDB can exploit its knowledge of
>     >>> data partitioning and the availability of indexes to avoid always
>     >>> scanning data set(s) to process queries. Somewhat
>     surprisingly, there
>     >>> is no open source parallel database system (relational or
>     otherwise)
>     >>> available to developers today -- AsterixDB aims to fill this need.
>     >>> Since Apache is where the majority of the today's most
>     important Big
>     >>> Data technologies live, the ASF seems like the obvious home for a
>     >>> system like AsterixDB.
>     >>>
>     >>> Current Status
>     >>>
>     >>> The current version of AsterixDB was co-developed by a team of
>     >>> faculty, staff, and students at UC Irvine and UC Riverside. The
>     >>> project was initiated as a large NSF-sponsored project in
>     2009, the
>     >>> goal of which was to combine the best ideas from the parallel
>     >>> database world, the then new Hadoop world, and the semi-structured
>     >>> (e.g., XML/JSON) data world in order to create a next-generation
>     >>> BDMS. A first informal open source release was made four years
>     later,
>     >>> in June of 2013, under the Apache Software License 2.0.
>     >>>
>     >>>
>     >>> Meritocracy
>     >>>
>     >>> The current developers are familiar with meritocratic open source
>     >>> development at Apache. Apache was chosen specifically because
>     we want
>     >>> to encourage this style of development for the project.
>     >>>
>     >>>
>     >>> Community
>     >>>
>     >>> While AsterixDB started as a university project it has
>     developed into
>     >>> a community. A number of the initial committers started
>     contributing
>     >>> in academia and continue to actively participate and
>     contribute after
>     >>> graduation. And we seek to further develop developer and user
>     >>> communities. One way to broaden the community that is ongoing is
>     >>> through academic collaborations (currently with IIT Mumbai in
>     India
>     >>> and TU Berlin in Germany). During incubation we will also
>     explicitly
>     >>> seek increased industrial participation.
>     >>>
>     >>> Some indicators of the effort's development community and
>     history can
>     >>> be
>     >>> found at:
>     >>>
>     https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
>     >>>
>     https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>     >>>
>     >>>
>     >>> Core Developers
>     >>>
>     >>> The core developers of the project are diverse, although
>     initially UC
>     >>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
>     >>> other 50 are from other academic institutions (UC Riverside
>     and the
>     >>> Hebrew University in Jerusalem) and companies (Couchbase,
>     Facebook,
>     >>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>     >>>
>     >>>
>     >>> Alignment
>     >>>
>     >>> Apache is, by far, the most natural home for taking the AsterixDB
>     >>> project forward. A large fraction of today's top Big Data
>     >>> technologies have their homes in Apache, including Hadoop,
>     YARN, Pig,
>     >>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>     >>> significant gap -- the parallel data management system gap -- that
>     >>> exists in the Big Data open source world. It is well-aligned
>     with a
>     >>> number of the Apache projects, e.g., it has strong support for
>     >>> accessing and indexing external data in HDFS, and it uses YARN
>     as an
>     >>> answer to basic cluster resource management. AsterixDB also
>     seeks to
>     >>> achieve an Apache-style development model; it is seeking a broader
>     >>> community of contributors and users in order to achieve its full
>     >>> potential and value to the Big Data community.
>     >>>
>     >>> There are also a number of related Apache projects and
>     dependencies
>     >>> that will be mentioned below in the Relationships with Other
>     Apache
>     >>> products section.
>     >>>
>     >>>
>     >>> Known Risks
>     >>>
>     >>> Orphaned products
>     >>>
>     >>> Given the current level of intellectual investment in
>     AsterixDB, the
>     >>> risk of the project being abandoned is very small. The UCI/UCR
>     >>> faculty team leads are highly incentivized to continue development
>     >>> since the database groups at UC Irvine and UC Riverside are both
>     >>> reliant on AsterixDB as a platform for long-term graduate research
>     >>> projects. UC San Diego is also beginning to contribute to the code
>     >>> base, and a collaboration involving public health applications is
>     >>> forming with UCLA. The work on AsterixDB is managed via a mix of
>     >>> mailing list discussions supplemented by weekly project status
>     >>> meetings which are summarized on the mailing list. Typical (local
>     >>> plus Skype-in) attendance to the weekly status meetings runs
>     at about
>     >>> 20 active contributors.
>     >>>
>     >>>
>     >>> Inexperience with Open Source
>     >>>
>     >>> AsterixDB and Hyracks were completely developed in Open Source
>     under
>     >>> the ASL 2.0. The source code repositories, issue tracker, and
>     mailing
>     >>> lists are available on Google Code and discussions and decisions
>     >>> happen on the mailing lists (which is necessary due to the
>     geographic
>     >>> distribution of the current developers).
>     >>>
>     >>> Also a few of the initial committers have contributed to Apache
>     >>> projects. Vinayak Borkar is a committer on the Apache Helix and
>     >>> Apache VXQuery projects. Till Westmann is the VP VXQuery at
>     the ASF
>     >>> and an IPMC member. Preston Carman and Steven Jacobs are
>     committers
>     >>> on the Apache VXQuery project.
>     >>>
>     >>>
>     >>> Relationships with Other Apache Products
>     >>>
>     >>> Apache VXQuery is based on the Hyracks data-parallel runtime,
>     which
>     >>> is also included in the AsterixDB code base.
>     >>>
>     >>> AsterixDB is closely related to Apache Hadoop. Included in
>     AsterixDB
>     >>> is support for accessing external data in HDFS (and Hive formats),
>     >>> and resource management and system administration features are
>     in the
>     >>> process of being migrated to YARN.
>     >>>
>     >>> AsterixDB's AQL query facilities offer comparable query power to
>     >>> Apache's Pig and Hive systems for big data analytics. AsterixDB
>     >>> differs in storing and indexing data and thus being able to
>     quickly
>     >>> answer small and medium queries without large HDFS data scans -
>     >>> thereby targeting a different class of use cases.
>     >>>
>     >>> AsterixDB's data storage and indexing facilities are similar
>     to those
>     >>> of HBase, but AsterixDB differs in being a much more complete and
>     >>> queryable BDMS (not just a key-value style store).
>     >>>
>     >>> AsterixDB's target use cases are not in-memory processing or
>     >>> iterative algorithm support, making AsterixDB complementary to the
>     >>> Apache Spark platform. (Spark interoperability is on our
>     longer-term
>     >>> to-do wishlist.)
>     >>>
>     >>>
>     >>> Homogeneous Developers
>     >>>
>     >>> As mentioned before the current community is already
>     organizationally
>     >>> and geographically distributed - and we would like to increase the
>     >>> heterogeneity.
>     >>>
>     >>>
>     >>> Reliance on Salaried Developers
>     >>>
>     >>> Of the initial committers only 3 are full-time UCI staff. The
>     other
>     >>> committers are a mix of students, alumni who continue to
>     contribute
>     >>> to the effort, and individuals working with permission
>     part-time (or
>     >>> in spare time) on this project.
>     >>>
>     >>>
>     >>> A Excessive Fascination with the Apache Brand
>     >>>
>     >>> We believe in the processes, systems, and framework Apache has
>     put in
>     >>> place. Apache is also known to foster a great community around
>     their
>     >>> projects and provide exposure. While brand is important, our
>     >>> fascination with it is not excessive. We believe that the ASF
>     is the
>     >>> right home for AsterixDB and that having AsterixDB inside of
>     the ASF
>     >>> will lead to a better long-term outcome for the Big Data
>     community.
>     >>>
>     >>>
>     >>> Documentation
>     >>>
>     >>> Documentation and publications related to AsterixDB can be
>     found at
>     >>> http://asterixdb.ics.uci.edu/.
>     >>>
>     >>>
>     >>> Initial Source
>     >>>
>     >>> Current source resides in Google code:
>     >>> https://code.google.com/p/asterixdb/ (query language and upper
>     system
>     >>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>     >>> system and storage management libraries).
>     >>>
>     >>>
>     >>> External Dependencies
>     >>>
>     >>> AsterixDB depends on a number of Apache projects:
>     >>>
>     >>> - Ant
>     >>> - Avro
>     >>> - ApacheDB JDO
>     >>> - Commons
>     >>> - Derby
>     >>> - Hadoop
>     >>> - Hive
>     >>> - HTTPComponents
>     >>> - Jakarta ORO
>     >>> - Maven
>     >>> - Tomcat
>     >>> - Thrift
>     >>> - Velocity
>     >>> - Wicket
>     >>> - Xerces
>     >>>
>     >>> and other open source projects (organized by license):
>     >>>
>     >>> -- ASL 2.0:
>     >>> - Jackson
>     >>> - Google Guava
>     >>> - Google Guice
>     >>> - JSON-simple
>     >>> - BoneCP
>     >>> - Microsoft Azure SDK
>     >>> - Netty
>     >>> - Rome
>     >>> - JetS3t
>     >>> - Groovy
>     >>> - Jettison
>     >>> - Plexus
>     >>> - Datanucleus (JDO)
>     >>> - Jetty
>     >>> - Twitter4J
>     >>> - Snappy-java
>     >>>
>     >>> -- BSD:
>     >>> - Antlr
>     >>> - ObjectWeb ASM
>     >>> - Protobuf
>     >>> - JSCH
>     >>> - JavaCC
>     >>> - Paranamer
>     >>> - JLine
>     >>> - Stax
>     >>> - StringTemplate
>     >>> - xmlEnc
>     >>>
>     >>> -- MIT
>     >>> - AppAssembler
>     >>> - SimpleLog4J
>     >>>
>     >>> -- CDDL 1.0
>     >>> - Java Activation Framework
>     >>> - Java Transactions
>     >>> - Java Servlet API
>     >>> - Grizzly
>     >>> - gmbal
>     >>> - Glassfish
>     >>>
>     >>> -- CDDL 1.1
>     >>> - Jersey
>     >>> - JAXB Reference Implementation
>     >>>
>     >>> -- JSON License
>     >>> - JSON
>     >>>
>     >>> -- EPL 1.0
>     >>> - JUnit
>     >>>
>     >>> -- JDOM License
>     >>> - JDOM
>     >>>
>     >>> -- Public Domain
>     >>> - xz
>     >>> - AOPAlliance
>     >>>
>     >>> As all dependencies are managed using Apache Maven, none of the
>     >>> external libraries need to be packaged in a source distribution.
>     >>>
>     >>>
>     >>> Required Resources
>     >>>
>     >>> Developer and user mailing lists
>     >>>
>     >>> private@asterixdb.incubator.apache.org
>     <mailto:private@asterixdb.incubator.apache.org> (with moderated
>     subscriptions)
>     >>> commits@asterixdb.incubator.apache.org
>     <mailto:commits@asterixdb.incubator.apache.org>
>     >>> dev@asterixdb.incubator.apache.org
>     <mailto:dev@asterixdb.incubator.apache.org>
>     >>> users@asterixdb.incubator.apache.org
>     <mailto:users@asterixdb.incubator.apache.org>
>     >>>
>     >>>
>     >>> A git repository
>     >>>
>     >>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>     >>>
>     >>>
>     >>> A JIRA issue tracker
>     >>>
>     >>> https://issues.apache.org/jira/browse/ASTERIXDB
>     >>>
>     >>>
>     >>> Initial Committers
>     >>>
>     >>> The following is a list of the planned initial Apache
>     committers (the
>     >>> active subset of the committers for the current repository at
>     Google
>     >>> code).
>     >>>
>     >>> Abdullah Alamoudi (bamousaa@gmail.com <mailto:bamousaa@gmail.com>)
>     >>> Cameron Samak (eufery@gmail.com <mailto:eufery@gmail.com>)
>     >>> Chen Li (chenli@gmail.com <mailto:chenli@gmail.com>)
>     >>> Ian Maxon (imaxon@uci.edu <mailto:imaxon@uci.edu>)
>     >>> Ildar Absalyamov (ildar.absalyamov@gmail.com
>     <mailto:ildar.absalyamov@gmail.com>)
>     >>> Jianfeng Jia (jianfeng.jia@gmail.com
>     <mailto:jianfeng.jia@gmail.com>)
>     >>> Karen Ouaknine (kereno@gmail.com <mailto:kereno@gmail.com>)
>     >>> Markus Dreseler (apache@dreseler.de <mailto:apache@dreseler.de>)
>     >>> Mike Carey (dtabass@apache.org <mailto:dtabass@apache.org>)
>     >>> Murtadha Hubail (hubailmor@gmail.com <mailto:hubailmor@gmail.com>)
>     >>> Pouria Pirzadeh (pouria.pirzadeh@gmail.com
>     <mailto:pouria.pirzadeh@gmail.com>)
>     >>> Preston Carman (prestonc@apache.org <mailto:prestonc@apache.org>)
>     >>> Raman Grover (RamanGrover29@gmail.com
>     <mailto:RamanGrover29@gmail.com>)
>     >>> Sattam Alsubaiee (salsubaiee@gmail.com
>     <mailto:salsubaiee@gmail.com>)
>     >>> Steven Jacobs (sjaco002@apache.org <mailto:sjaco002@apache.org>)
>     >>> Taewoo Kim (wangsaeu@gmail.com <mailto:wangsaeu@gmail.com>)
>     >>> Till Westmann (tillw@apache.org <mailto:tillw@apache.org>)
>     >>> Vinayak Borkar (vinayakb@apache.org <mailto:vinayakb@apache.org>)
>     >>> Yingyi Bu (buyingyi@gmail.com <mailto:buyingyi@gmail.com>)
>     >>> Young-Seok Kim (kisskys@gmail.com <mailto:kisskys@gmail.com>)
>     >>> Zach Heilbron (zheilbron@gmail.com <mailto:zheilbron@gmail.com>)
>     >>>
>     >>>
>     >>> Affiliations
>     >>>
>     >>> UC Irvine
>     >>> - Mike Carey
>     >>> - Chen Li
>     >>> - Ian Maxon
>     >>> - Yingyi Bu
>     >>> - Raman Grover
>     >>> - Pouria Pirzadeh
>     >>> - Young-Seok Kim
>     >>> - Cameron Samak
>     >>> - Taewoo Kim
>     >>> - Jianfeng Jia
>     >>> - Murtadha Hubail
>     >>> - Markus Dreseler
>     >>>
>     >>> UC Riverside
>     >>> - Ildar Absalyamov
>     >>> - Preston Carman
>     >>> - Steven Jacobs
>     >>>
>     >>> Hebrew University
>     >>> - Keren Ouaknine
>     >>>
>     >>> Oracle
>     >>> - Till Westmann
>     >>>
>     >>> X15 Software
>     >>> - Vinayak Borkar
>     >>> - Zach Heilbron
>     >>>
>     >>> KACST Saudi Arabia
>     >>> - Sattam Alsubaiee
>     >>>
>     >>> Saudi Aramco
>     >>> - Abdullah Alamoudi
>     >>>
>     >>> Carey, Li, and Maxon are full-time UCI staff, with the
>     remaining UCI
>     >>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>     >>> non-UC committers are a mix of alumni who continue to
>     contribute to
>     >>> the effort and individuals working with permission part-time
>     (or in
>     >>> spare time) on this project.
>     >>>
>     >>>
>     >>> Sponsors
>     >>>
>     >>> Champion
>     >>>
>     >>> Chris Mattmann (NASA/JPL)
>     >>>
>     >>> Nominated Mentors
>     >>>
>     >>> TBD
>     >>>
>     >>> Sponsoring Entity
>     >>>
>     >>> The Apache Incubator
>     >>>
>     >>>
>     >>>
>     >>>
>     >>>
>     >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     >>> Chris Mattmann, Ph.D.
>     >>> Chief Architect
>     >>> Instrument Software and Science Data Systems Section (398)
>     >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     >>> Office: 168-519, Mailstop: 168-527
>     >>> Email: chris.a.mattmann@nasa.gov
>     <mailto:chris.a.mattmann@nasa.gov>
>     >>> WWW: http://sunset.usc.edu/~mattmann/
>     <http://sunset.usc.edu/%7Emattmann/>
>     >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     >>> Adjunct Associate Professor, Computer Science Department
>     >>> University of Southern California, Los Angeles, CA 90089 USA
>     >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     >>>
>     >>>
>     >>>
>     >>>
>     >
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>     <mailto:general-unsubscribe@incubator.apache.org>
>     For additional commands, e-mail: general-help@incubator.apache.org
>     <mailto:general-help@incubator.apache.org>
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message