incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject [RESULT][VOTE] Accept MRQL into the Incubator
Date Tue, 12 Mar 2013 09:12:30 GMT
The VOTE has passed with 5 binding +1's and no -1s.

I'll start the work to get the podling started.

Thank you.

On Sun, Mar 10, 2013 at 3:48 AM, Mattmann, Chris A (388J)
<chris.a.mattmann@jpl.nasa.gov> wrote:
> +1 from me (binding).
>
> Good luck!
>
> Cheers,
> Chris
>
>
> On 3/6/13 9:04 AM, "Leonidas Fegaras" <fegaras@cse.uta.edu> wrote:
>
>>Dear ASF members,
>>I would like to call for a VOTE for acceptance of MRQL into the
>>Incubator.
>>The vote will close on Monday March 11, 2013.
>>
>>[ ] +1 Accept MRQL into the Apache incubator
>>[ ] +0 Don't care.
>>[ ] -1 Don't accept MRQL into the incubator because...
>>
>>Full proposal is pasted below and the corresponding wiki is
>>
>>http://wiki.apache.org/incubator/MRQLProposal
>>
>>Only VOTEs from Incubator PMC members are binding,
>>but all are welcome to express their thoughts.
>>Sincerely,
>>Leonidas Fegaras
>>
>>
>>= Abstract =
>>
>>MRQL is a query processing and optimization system for large-scale,
>>distributed data analysis, built on top of Apache Hadoop and Hama.
>>
>>= Proposal =
>>
>>MRQL (pronounced ''miracle'') is a query processing and optimization
>>system for large-scale, distributed data analysis. MRQL (the MapReduce
>>Query Language) is an SQL-like query language for large-scale data
>>analysis on a cluster of computers. The MRQL query processing system
>>can evaluate MRQL queries in two modes: in MapReduce mode on top of
>>Apache Hadoop or in Bulk Synchronous Parallel (BSP) mode on top of
>>Apache Hama. The MRQL query language is powerful enough to express
>>most common data analysis tasks over many forms of raw ''in-situ''
>>data, such as XML and JSON documents, binary files, and CSV
>>documents. MRQL is more powerful than other current high-level
>>MapReduce languages, such as Hive and PigLatin, since it can operate
>>on more complex data and supports more powerful query constructs, thus
>>eliminating the need for using explicit MapReduce code. With MRQL,
>>users will be able to express complex data analysis tasks, such as
>>PageRank, k-means clustering, matrix factorization, etc, using
>>SQL-like queries exclusively, while the MRQL query processing system
>>will be able to compile these queries to efficient Java code.
>>
>>= Background =
>>
>>The initial code was developed at the University of Texas of Arlington
>>(UTA) by a research team, led by Leonidas Fegaras. The software was
>>first released in May 2011. The original goal of this project was to
>>build a query processing system that translates SQL-like data analysis
>>queries to efficient workflows of MapReduce jobs. A design goal was to
>>use HDFS as the physical storage layer, without any indexing, data
>>partitioning, or data normalization, and to use Hadoop (without
>>extensions) as the run-time engine. The motivation behind this work
>>was to build a platform to test new ideas on query processing and
>>optimization techniques applicable to the MapReduce framework.
>>
>>A year ago, MRQL was extended to run on Hama. The motivation for this
>>extension was that Hadoop MapReduce jobs were required to read their
>>input and write their output on HDFS. This simplifies reliability and
>>fault tolerance but it imposes a high overhead to complex MapReduce
>>workflows and graph algorithms, such as PageRank, which require
>>repetitive jobs. In addition, Hadoop does not preserve data in memory
>>across consecutive MapReduce jobs. This restriction requires to read
>>data at every step, even when the data is constant. BSP, on the other
>>hand, does not suffer from this restriction, and, under certain
>>circumstances, allows complex repetitive algorithms to run entirely in
>>the collective memory of a cluster. Thus, the goal was to be able to
>>run the same MRQL queries in both modes, MapReduce and BSP, without
>>modifying the queries: If there are enough resources available, and
>>low latency and speed are more important than resilience, queries may
>>run in BSP mode; otherwise, the same queries may run in MapReduce
>>mode. BSP evaluation was found to be a good choice when fault
>>tolerance is not critical, data (both input and intermediate) can fit
>>in the cluster memory, and data processing requires complex/repetitive
>>steps.
>>
>>The research results of this ongoing work have already been published
>>in conferences (WebDB'11, EDBT'12, and DataCloud'12) and the authors
>>have already received positive feedback from researchers in academia
>>and industry who were attending these conferences.
>>
>>= Rationale =
>>
>>* MRQL will be the first general-purpose, SQL-like query language for
>>data analysis based on BSP.
>>Currently, many programmers prefer to code their MapReduce
>>applications in a higher-level query language, rather than an
>>algorithmic language. For instance, Pig is used for 60% of Yahoo
>>MapReduce jobs, while Hive is used for 90% of Facebook MapReduce
>>jobs. This, we believe, will also be the trend for BSP applications,
>>because, even though, in principle, the BSP model is very simple to
>>understand, it is hard to develop, optimize, and maintain non-trivial
>>BSP applications coded in a general-purpose programming
>>language. Currently, there is no widely acceptable declarative BSP
>>query language, although there are a few special-purpose BSP systems
>>for graph analysis, such as Google Pregel and Apache Giraph, for
>>machine learning, such as BSML, and for scientific data analysis.
>>
>>* MRQL can capture many complex data analysis algorithms in
>>declarative form.
>>Existing MapReduce query languages, such as HiveQL and PigLatin,
>>provide a limited syntax for operating on data collections, in the
>>form of relational joins and group-bys. Because of these limitations,
>>these languages enable users to plug-in custom MapReduce scripts into
>>their queries for those jobs that cannot be declaratively coded in
>>their query language. This nullifies the benefits of using a
>>declarative query language and may result to suboptimal, error-prone,
>>and hard-to-maintain code. More importantly, these languages are
>>inappropriate for complex scientific applications and graph analysis,
>>because they do not directly support iteration or recursion in
>>declarative form and are not able to handle complex, nested scientific
>>data, which are often semi-structured. Furthermore, current MapReduce
>>query processors apply traditional query optimization techniques that
>>may be suboptimal in a MapReduce or BSP environment.
>>
>>* The MRQL design is modular, with pluggable distributed processing
>>back-ends, query languages, and data formats.
>>MRQL aims to be both powerful and adaptable. Although Hadoop is
>>currently the most popular framework for large-scale data analysis,
>>there are a few alternatives that are currently shaping form,
>>including frameworks based on BSP (eg, Giraph, Pregel, Hama), MPI
>>(eg, OpenMPI), etc. MRQL was designed in such a way so that it will
>>be easy to support other distributed processing frameworks in the
>>future. As an evidence of this claim, the MRQL processor required
>>only 2K extra lines of Java code to support BSP evaluation.
>>
>>= Initial Goals =
>>
>>Some current goals include:
>>
>>* apply MRQL to graph analysis problems, such as k-means clustering
>>and PageRank
>>
>>* apply MRQL to large-scale scientific analysis (develop general
>>optimization techniques that can apply to matrix multiplication,
>>matrix factorization, etc)
>>
>>* process additional data formats, such as Avro, and column-based
>>stores, such as HBase
>>
>>* map MRQL to additional distributed processing frameworks, such as
>>Spark and OpenMPI
>>
>>* extend the front-end to process more query languages, such as
>>standard SQL, SPARQL, XQuery, and PigLatin
>>
>>= Current Status =
>>
>>The current MRQL release (version 0.8.10) is a beta release. It is
>>built on top of Hadoop and Hama (no extensions are needed). It
>>currently works on Hadoop up to 1.0.4 (but not on Yarn yet) and Hama
>>0.5.0. It has only been tested on a small cluster of 20 nodes (80
>>cores).
>>
>>== Meritocracy ==
>>
>>The initial MRQL code base was developed by Leonidas Fegaras in May
>>2011, and was continuously improved throughout the years. We will
>>reach out other potential contributors through open forums. We plan
>>to do everything possible to encourage an environment that supports a
>>meritocracy, where contributors will extend their privileges based on
>>their contribution. MRQL's modular design will facilitate the
>>strategic extensions to various modules, such as adding a standard-SQL
>>interface, introducing new optimization techniques, etc.
>>
>>== Community ==
>>
>>The interest in open-source query processing systems for analyzing
>>large datasets has been steadily increased in the last few years.
>>Related Apache projects have already attracted a very large community
>>from both academia and industry. We expect that MRQL will also
>>establish an active community. Several researchers from both academia
>>and industry who are interested in using our code have already
>>contacted us.
>>
>>== Core Developers ==
>>
>>The initial core developer was Leonidas Fegaras, who wrote the
>>majority of the code. He is an associate professor at UTA, with
>>interests in cloud computing, databases, web technologies, and
>>functional programming. He has an extensive knowledge and working
>>experience in building complex query processing systems for databases,
>>and compilers for functional and algorithmic programming languages.
>>
>>== Alignment ==
>>
>>MRQL is built on top of two Apache projects: Hadoop and Hama. We have
>>plans to incorporate other products from the Hadoop ecosystem, such as
>>Avro and HBase. MRQL can serve as a testbed for fine-tuning and
>>evaluating the performance of the Apache Hama system. Finally, the
>>MRQL query language and processor can be used by Apache Drill as a
>>pluggable query language.
>>
>>= Known Risks =
>>
>>== Orphaned Products ==
>>
>>The initial committer is from academia, which may be a risk, since
>>research in academia is publication-driven, rather than
>>product-driven. It happens very often in academic research, when a
>>project becomes outdated and doesn't produce publishable results, to
>>be abandoned in favor of new cutting-edge projects. We do not believe
>>that this will be the case for MRQL for the years to come, because it
>>can be adapted to support new query languages, new optimization
>>techniques, and new distributed back-ends, thus sustaining enough
>>research interest. Another risk is that, when graduate students who
>>write code graduate, they may leave their work undocumented and
>>unfinished. We will strive to gain enough momentum to recruit
>>additional committers from industry in order to eliminate these risks.
>>
>>== Inexperience with Open Source ==
>>
>>The initial developer has been involved with various projects whose
>>source code has been released under open source license, but he has no
>>prior experience on contributing to open-source projects. With the
>>guidance from other more experienced committers and participants, we
>>expect that the meritocracy rules will have a positive influence on
>>this project.
>>
>>== Homogeneous Developers ==
>>
>>The initial committer comes from academia. However, given the interest
>>we have seen in the project, we expect the diversity to improve in the
>>near future.
>>
>>== Reliance on Salaried Developers ==
>>
>>Currently, the MRQL code was developed on the committer's volunteer
>>time. In the future, UTA graduate students who will do some of the
>>coding may be supported by UTA and funding agencies, such as NSF.
>>
>>== Relationships with Other Apache Products ==
>>
>>MRQL has some overlapping functionality with Hive and Tajo, which are
>>Data Warehouse systems for Hadoop, and with Drill, which is an
>>interactive data analysis system that can process nested data. MRQL
>>has a more powerful data model, in which any form of nested data, such
>>as XML and JSON, can be defined as a user-defined datatype. More
>>importantly, complex data analysis tasks, such as PageRank, k-means
>>clustering, and matrix multiplication and factorization, can be
>>expressed as short SQL-like queries, while the MRQL system is able to
>>evaluate these queries efficiently. Furthermore, the MRQL system can
>>run these queries in BSP mode, in addition to MapReduce mode, thus
>>achieving low latency and speed, which are also Drill's goals.
>>Nevertheless, we will welcome and encourage any help from these
>>projects and we will be eager to make contributions to these projects
>>too.
>>
>>== An Excessive Fascination with the Apache Brand ==
>>
>>The Apache brand is likely to help us find contributors and reach out
>>to the open-source community. Nevertheless, since MRQL depends on
>>Apache projects (Hadoop and Hama), it makes sense to have our software
>>available as part of this ecosystem.
>>
>>= Documentation =
>>
>>Information about MRQL can be found at http://lambda.uta.edu/mrql/
>>
>>= Initial Source =
>>
>>The initial MRQL code has been released as part of a research project
>>developed at the University of Texas at Arlington under the Apache 2.0
>>license for the past two years. The source code is currently hosted
>>on GitHub at: https://github.com/fegaras/mrql MRQL’s release artifact
>>would consist of a single tarball of packaging and test code.
>>
>>= External Dependencies =
>>
>>The MRQL source code is already licensed under the Apache License,
>>Version 2.0. MRQL uses JLine which is distributed under the BSD
>>license.
>>
>>= Cryptography =
>>
>>Not applicable.
>>
>>= Required Resources =
>>
>>== Mailing Lists ==
>>
>>* mrql-private
>>* mrql-dev
>>* mrql-user
>>
>>== Subversion Directory ==
>>
>>* Git is the preferred source control system:
>>git://git.apache.org/mrql
>>
>>== Issue Tracking ==
>>
>>* A JIRA issue tracker, MRQL
>>
>>== Wiki ==
>>
>>  * Moinmoin wiki, http://wiki.apache.org/mrql
>>
>>= Initial Committers =
>>
>>* Leonidas Fegaras <fegaras AT cse DOT uta DOT edu>
>>* Upa Gupta <upa.gupta AT mavs DOT uta DOT edu>
>>* Edward J. Yoon <edwardyoon AT apache DOT org>
>>* Maqsood Alam <maqsoodalam AT hotmail DOT com>
>>* John Hope <john.hope AT oracle DOT com>
>>* Mark Wall <mark.wall AT oracle DOT com>
>>* Kuassi Mensah <kuassi.mensah AT oracle DOT com>
>>* Ambreesh Khanna <ambreesh.khanna AT oracle DOT com>
>>* Karthik Kambatla <kasha AT cloudera DOT com>
>>
>>= Affiliations =
>>
>>* Leonidas Fegaras (University of Texas at Arlington)
>>* Upa Gupta (University of Texas at Arlington)
>>* Edward J. Yoon (Oracle corp)
>>* Maqsood Alam (Oracle corp)
>>* John Hope (Oracle corp)
>>* Mark Wall (Oracle corp)
>>* Kuassi Mensah (Oracle corp)
>>* Ambreesh Khanna (Oracle corp)
>>* Karthik Kambatla (Cloudera)
>>
>>= Sponsors =
>>
>>== Champion ==
>>
>>* Edward J. Yoon <edwardyoon AT apache DOT org>
>>
>>== Nominated Mentors ==
>>
>>* Alex Karasulu <akarasulu AT apache DOT org>
>>* Edward J. Yoon <edwardyoon AT apache DOT org>
>>
>>== Sponsoring Entity ==
>>
>>Incubator PMC
>>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message