Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DABFE1748C for ; Tue, 20 Jan 2015 01:49:39 +0000 (UTC) Received: (qmail 26202 invoked by uid 500); 20 Jan 2015 01:49:41 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 26037 invoked by uid 500); 20 Jan 2015 01:49:41 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 26020 invoked by uid 99); 20 Jan 2015 01:49:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Jan 2015 01:49:40 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dtabass@gmail.com designates 209.85.218.48 as permitted sender) Received: from [209.85.218.48] (HELO mail-oi0-f48.google.com) (209.85.218.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Jan 2015 01:49:36 +0000 Received: by mail-oi0-f48.google.com with SMTP id u20so29427408oif.7 for ; Mon, 19 Jan 2015 17:47:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type; bh=FwOD7HsbD4RWlXSuzacxYeJzz3Melqa0MNu9i5LRCU4=; b=1ANkZvpFKxhSNri4qmKQFD2QQlvl3LsM68O8Y7bhOXDmmge+XT9By5sPmBx4QfN2dz X0E0VqAZ5ZKte6el2pgrPN4z3s21oC/dpfOZr1YzbjJxdBtqAtpPerrHkQflJ29wtehJ u266uDLJmLJCaPVXY9NnAOYZUOt5eu4YGhcTcBbIB1rw95+hBJFPy2otkhUYx2idD1M9 lxO7xdstyddqMLX6QldvzYaZe7Ht1xJpDtH9QHH3DnOUgwRouCyK8LIF0YK10/fD9jOl BPjoaG/+ftxtwoTmXZ/F9u5kfd+KljkshvisasdDhc5JLocu5l+jPdBAn+r6YeY+Oezm Ontg== X-Received: by 10.60.78.137 with SMTP id b9mr20205793oex.36.1421718420822; Mon, 19 Jan 2015 17:47:00 -0800 (PST) Received: from mikejcarey.local (ip72-219-187-63.oc.oc.cox.net. [72.219.187.63]) by mx.google.com with ESMTPSA id ve6sm6808017obb.2.2015.01.19.17.46.58 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 19 Jan 2015 17:46:59 -0800 (PST) Message-ID: <54BDB38F.8020801@gmail.com> Date: Mon, 19 Jan 2015 17:46:55 -0800 From: Mike Carey User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Till Westmann , Henry Saputra CC: "general@incubator.apache.org" , Ian Maxon Subject: Re: [PROPOSAL] Apache AsterixDB Incubator References: In-Reply-To: Content-Type: multipart/alternative; boundary="------------000004060605060905020103" X-Virus-Checked: Checked by ClamAV on apache.org --------------000004060605060905020103 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Indeed - thanks!! Cheers, Mike On 1/19/15 5:28 PM, Till Westmann wrote: > Hi Henry, > > thanks! It’s great that you’ve seen (and liked) AsterixDB before. > > Even if your time is very limited we would be very happy to have you on board as a mentor. > I’ll add you to the proposal. > > Cheers, > Till > >> On Jan 19, 2015, at 10:26 AM, Henry Saputra wrote: >> >> +1 This is GREAT News! >> >> Was watching and trying AsterixDB last year and looked in awesome shape. >> >> I have my plate full but would love to help mentor this project to get >> it going to ASF if needed! >> >> - Henry >> >> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980) >> wrote: >>> Hi Folks, >>> >>> I am pleased to bring forth the Apache AsterixDB proposal to the >>> Apache Incubator as Champion, working in collaboration with the >>> team. Please find the wiki proposal here: >>> >>> https://wiki.apache.org/incubator/AsterixDBProposal >>> >>> >>> Full text of the proposal is below. Please discuss and enjoy. I’ll >>> leave the discussion open for a week, and then look to call a VOTE >>> hopefully end of next week if all is well. >>> >>> Cheers! >>> Chris Mattmann >>> >>> ============================================================= >>> Apache AsterixDB Proposal >>> >>> Abstract >>> >>> Apache AsterixDB is a scalable big data management system (BDMS) that >>> provides storage, management, and query capabilities for large >>> collections of semi-structured data. >>> >>> Proposal >>> >>> AsterixDB is a big data management system (BDMS) that makes it >>> well-suited to needs such as web data warehousing and social data >>> storage and analysis. Feature-wise, AsterixDB has: >>> >>> * A NoSQL style data model (ADM) based on extending JSON with object >>> database concepts. >>> * An expressive and declarative query language (AQL) for querying >>> semi-structured data. >>> * A runtime query execution engine, Hyracks, for partitioned-parallel >>> execution of query plans. >>> * Partitioned LSM-based data storage and indexing for efficient >>> ingestion of newly arriving data. >>> * Support for querying and indexing external data (e.g., in HDFS) as >>> well as data stored within AsterixDB. >>> * A rich set of primitive data types, including support for spatial, >>> temporal, and textual data. >>> * Indexing options that include B+ trees, R trees, and inverted >>> keyword index support. >>> * Basic transactional (concurrency and recovery) capabilities akin to >>> those of a NoSQL store. >>> >>> >>> Background and Rationale >>> >>> In the world of relational databases, the need to tackle data volumes >>> that exceed the capabilities of a single server led to the >>> development of “shared-nothing” parallel database systems several >>> decades ago. These systems spread data over a cluster based on a >>> partitioning strategy, such as hash partitioning, and queries are >>> processed by employing partitioned-parallel divide-and-conquer >>> techniques. Since these systems are fronted by a high-level, >>> declarative language (SQL), their users are shielded from the >>> complexities of parallel programming. Parallel database systems have >>> been an extremely successful application of parallel computing, and >>> quite a number of commercial products exist today. >>> >>> In the distributed systems world, the Web brought a need to index and >>> query its huge content. SQL and relational databases were not the >>> answer, though shared-nothing clusters again emerged as the hardware >>> platform of choice. Google developed the Google File System (GFS) and >>> MapReduce programming model to allow programmers to store and process >>> Big Data by writing a few user-defined functions. The MapReduce >>> framework applies these functions in parallel to data instances in >>> distributed files (map) and to sorted groups of instances sharing a >>> common key (reduce) -- not unlike the partitioned parallelism in >>> parallel database systems. Apache's Hadoop MapReduce platform is the >>> most prominent implementation of this paradigm for the rest of the >>> Big Data community. On top of Hadoop and HDFS sit declarative >>> languages like Pig and Hive that each compile down to Hadoop >>> MapReduce jobs. >>> >>> The big Web companies were also challenged by extreme user bases >>> (100s of millions of users) and needed fast simple lookups and >>> updates to very large keyed data sets like user profiles. SQL >>> databases were deemed either too expensive or not scalable, so the >>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two >>> popular key-value stores, in this space. MongoDB and Couchbase are >>> other open source alternatives (document stores). >>> >>> It is evident from the rapidly growing popularity of "NoSQL" stores, >>> as well as the strong demand for Big Data analytics engines today, >>> that there is a strong (and growing!) need to store, process, *and* >>> query large volumes of semi-structured data in many application >>> areas. Until very recently, developers have had to ``choose'' between >>> using big data analytics engines like Apache Hive or Apache Spark, >>> which can do complex query processing and analysis over HDFS-resident >>> files, and flexible but low-function data stores like MongoDB or >>> Apache HBase. (The Apache Phoenix project, >>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that >>> aims to bridge between these choices.) >>> >>> AsterixDB is a highly scalable data management system that can store, >>> index, and manage semi-structured data, e.g., much like MongoDB, but >>> it also supports a full-power query language with the expressiveness >>> of SQL (and more). Unlike analytics engines like Hive or Spark, it >>> stores and manages data, so AsterixDB can exploit its knowledge of >>> data partitioning and the availability of indexes to avoid always >>> scanning data set(s) to process queries. Somewhat surprisingly, there >>> is no open source parallel database system (relational or otherwise) >>> available to developers today -- AsterixDB aims to fill this need. >>> Since Apache is where the majority of the today's most important Big >>> Data technologies live, the ASF seems like the obvious home for a >>> system like AsterixDB. >>> >>> Current Status >>> >>> The current version of AsterixDB was co-developed by a team of >>> faculty, staff, and students at UC Irvine and UC Riverside. The >>> project was initiated as a large NSF-sponsored project in 2009, the >>> goal of which was to combine the best ideas from the parallel >>> database world, the then new Hadoop world, and the semi-structured >>> (e.g., XML/JSON) data world in order to create a next-generation >>> BDMS. A first informal open source release was made four years later, >>> in June of 2013, under the Apache Software License 2.0. >>> >>> >>> Meritocracy >>> >>> The current developers are familiar with meritocratic open source >>> development at Apache. Apache was chosen specifically because we want >>> to encourage this style of development for the project. >>> >>> >>> Community >>> >>> While AsterixDB started as a university project it has developed into >>> a community. A number of the initial committers started contributing >>> in academia and continue to actively participate and contribute after >>> graduation. And we seek to further develop developer and user >>> communities. One way to broaden the community that is ongoing is >>> through academic collaborations (currently with IIT Mumbai in India >>> and TU Berlin in Germany). During incubation we will also explicitly >>> seek increased industrial participation. >>> >>> Some indicators of the effort's development community and history can >>> be >>> found at: >>> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo, >>> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo >>> >>> >>> Core Developers >>> >>> The core developers of the project are diverse, although initially UC >>> Irvine heavy (roughly 50) due to the project's origins at UCI. The >>> other 50 are from other academic institutions (UC Riverside and the >>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook, >>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software). >>> >>> >>> Alignment >>> >>> Apache is, by far, the most natural home for taking the AsterixDB >>> project forward. A large fraction of today's top Big Data >>> technologies have their homes in Apache, including Hadoop, YARN, Pig, >>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a >>> significant gap -- the parallel data management system gap -- that >>> exists in the Big Data open source world. It is well-aligned with a >>> number of the Apache projects, e.g., it has strong support for >>> accessing and indexing external data in HDFS, and it uses YARN as an >>> answer to basic cluster resource management. AsterixDB also seeks to >>> achieve an Apache-style development model; it is seeking a broader >>> community of contributors and users in order to achieve its full >>> potential and value to the Big Data community. >>> >>> There are also a number of related Apache projects and dependencies >>> that will be mentioned below in the Relationships with Other Apache >>> products section. >>> >>> >>> Known Risks >>> >>> Orphaned products >>> >>> Given the current level of intellectual investment in AsterixDB, the >>> risk of the project being abandoned is very small. The UCI/UCR >>> faculty team leads are highly incentivized to continue development >>> since the database groups at UC Irvine and UC Riverside are both >>> reliant on AsterixDB as a platform for long-term graduate research >>> projects. UC San Diego is also beginning to contribute to the code >>> base, and a collaboration involving public health applications is >>> forming with UCLA. The work on AsterixDB is managed via a mix of >>> mailing list discussions supplemented by weekly project status >>> meetings which are summarized on the mailing list. Typical (local >>> plus Skype-in) attendance to the weekly status meetings runs at about >>> 20 active contributors. >>> >>> >>> Inexperience with Open Source >>> >>> AsterixDB and Hyracks were completely developed in Open Source under >>> the ASL 2.0. The source code repositories, issue tracker, and mailing >>> lists are available on Google Code and discussions and decisions >>> happen on the mailing lists (which is necessary due to the geographic >>> distribution of the current developers). >>> >>> Also a few of the initial committers have contributed to Apache >>> projects. Vinayak Borkar is a committer on the Apache Helix and >>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF >>> and an IPMC member. Preston Carman and Steven Jacobs are committers >>> on the Apache VXQuery project. >>> >>> >>> Relationships with Other Apache Products >>> >>> Apache VXQuery is based on the Hyracks data-parallel runtime, which >>> is also included in the AsterixDB code base. >>> >>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB >>> is support for accessing external data in HDFS (and Hive formats), >>> and resource management and system administration features are in the >>> process of being migrated to YARN. >>> >>> AsterixDB's AQL query facilities offer comparable query power to >>> Apache's Pig and Hive systems for big data analytics. AsterixDB >>> differs in storing and indexing data and thus being able to quickly >>> answer small and medium queries without large HDFS data scans - >>> thereby targeting a different class of use cases. >>> >>> AsterixDB's data storage and indexing facilities are similar to those >>> of HBase, but AsterixDB differs in being a much more complete and >>> queryable BDMS (not just a key-value style store). >>> >>> AsterixDB's target use cases are not in-memory processing or >>> iterative algorithm support, making AsterixDB complementary to the >>> Apache Spark platform. (Spark interoperability is on our longer-term >>> to-do wishlist.) >>> >>> >>> Homogeneous Developers >>> >>> As mentioned before the current community is already organizationally >>> and geographically distributed - and we would like to increase the >>> heterogeneity. >>> >>> >>> Reliance on Salaried Developers >>> >>> Of the initial committers only 3 are full-time UCI staff. The other >>> committers are a mix of students, alumni who continue to contribute >>> to the effort, and individuals working with permission part-time (or >>> in spare time) on this project. >>> >>> >>> A Excessive Fascination with the Apache Brand >>> >>> We believe in the processes, systems, and framework Apache has put in >>> place. Apache is also known to foster a great community around their >>> projects and provide exposure. While brand is important, our >>> fascination with it is not excessive. We believe that the ASF is the >>> right home for AsterixDB and that having AsterixDB inside of the ASF >>> will lead to a better long-term outcome for the Big Data community. >>> >>> >>> Documentation >>> >>> Documentation and publications related to AsterixDB can be found at >>> http://asterixdb.ics.uci.edu/. >>> >>> >>> Initial Source >>> >>> Current source resides in Google code: >>> https://code.google.com/p/asterixdb/ (query language and upper system >>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime >>> system and storage management libraries). >>> >>> >>> External Dependencies >>> >>> AsterixDB depends on a number of Apache projects: >>> >>> - Ant >>> - Avro >>> - ApacheDB JDO >>> - Commons >>> - Derby >>> - Hadoop >>> - Hive >>> - HTTPComponents >>> - Jakarta ORO >>> - Maven >>> - Tomcat >>> - Thrift >>> - Velocity >>> - Wicket >>> - Xerces >>> >>> and other open source projects (organized by license): >>> >>> -- ASL 2.0: >>> - Jackson >>> - Google Guava >>> - Google Guice >>> - JSON-simple >>> - BoneCP >>> - Microsoft Azure SDK >>> - Netty >>> - Rome >>> - JetS3t >>> - Groovy >>> - Jettison >>> - Plexus >>> - Datanucleus (JDO) >>> - Jetty >>> - Twitter4J >>> - Snappy-java >>> >>> -- BSD: >>> - Antlr >>> - ObjectWeb ASM >>> - Protobuf >>> - JSCH >>> - JavaCC >>> - Paranamer >>> - JLine >>> - Stax >>> - StringTemplate >>> - xmlEnc >>> >>> -- MIT >>> - AppAssembler >>> - SimpleLog4J >>> >>> -- CDDL 1.0 >>> - Java Activation Framework >>> - Java Transactions >>> - Java Servlet API >>> - Grizzly >>> - gmbal >>> - Glassfish >>> >>> -- CDDL 1.1 >>> - Jersey >>> - JAXB Reference Implementation >>> >>> -- JSON License >>> - JSON >>> >>> -- EPL 1.0 >>> - JUnit >>> >>> -- JDOM License >>> - JDOM >>> >>> -- Public Domain >>> - xz >>> - AOPAlliance >>> >>> As all dependencies are managed using Apache Maven, none of the >>> external libraries need to be packaged in a source distribution. >>> >>> >>> Required Resources >>> >>> Developer and user mailing lists >>> >>> private@asterixdb.incubator.apache.org (with moderated subscriptions) >>> commits@asterixdb.incubator.apache.org >>> dev@asterixdb.incubator.apache.org >>> users@asterixdb.incubator.apache.org >>> >>> >>> A git repository >>> >>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git >>> >>> >>> A JIRA issue tracker >>> >>> https://issues.apache.org/jira/browse/ASTERIXDB >>> >>> >>> Initial Committers >>> >>> The following is a list of the planned initial Apache committers (the >>> active subset of the committers for the current repository at Google >>> code). >>> >>> Abdullah Alamoudi (bamousaa@gmail.com) >>> Cameron Samak (eufery@gmail.com) >>> Chen Li (chenli@gmail.com) >>> Ian Maxon (imaxon@uci.edu) >>> Ildar Absalyamov (ildar.absalyamov@gmail.com) >>> Jianfeng Jia (jianfeng.jia@gmail.com) >>> Karen Ouaknine (kereno@gmail.com) >>> Markus Dreseler (apache@dreseler.de) >>> Mike Carey (dtabass@apache.org) >>> Murtadha Hubail (hubailmor@gmail.com) >>> Pouria Pirzadeh (pouria.pirzadeh@gmail.com) >>> Preston Carman (prestonc@apache.org) >>> Raman Grover (RamanGrover29@gmail.com) >>> Sattam Alsubaiee (salsubaiee@gmail.com) >>> Steven Jacobs (sjaco002@apache.org) >>> Taewoo Kim (wangsaeu@gmail.com) >>> Till Westmann (tillw@apache.org) >>> Vinayak Borkar (vinayakb@apache.org) >>> Yingyi Bu (buyingyi@gmail.com) >>> Young-Seok Kim (kisskys@gmail.com) >>> Zach Heilbron (zheilbron@gmail.com) >>> >>> >>> Affiliations >>> >>> UC Irvine >>> - Mike Carey >>> - Chen Li >>> - Ian Maxon >>> - Yingyi Bu >>> - Raman Grover >>> - Pouria Pirzadeh >>> - Young-Seok Kim >>> - Cameron Samak >>> - Taewoo Kim >>> - Jianfeng Jia >>> - Murtadha Hubail >>> - Markus Dreseler >>> >>> UC Riverside >>> - Ildar Absalyamov >>> - Preston Carman >>> - Steven Jacobs >>> >>> Hebrew University >>> - Keren Ouaknine >>> >>> Oracle >>> - Till Westmann >>> >>> X15 Software >>> - Vinayak Borkar >>> - Zach Heilbron >>> >>> KACST Saudi Arabia >>> - Sattam Alsubaiee >>> >>> Saudi Aramco >>> - Abdullah Alamoudi >>> >>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI >>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The >>> non-UC committers are a mix of alumni who continue to contribute to >>> the effort and individuals working with permission part-time (or in >>> spare time) on this project. >>> >>> >>> Sponsors >>> >>> Champion >>> >>> Chris Mattmann (NASA/JPL) >>> >>> Nominated Mentors >>> >>> TBD >>> >>> Sponsoring Entity >>> >>> The Apache Incubator >>> >>> >>> >>> >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Chief Architect >>> Instrument Software and Science Data Systems Section (398) >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 168-519, Mailstop: 168-527 >>> Email: chris.a.mattmann@nasa.gov >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Associate Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> >>> --------------000004060605060905020103--