incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: [VOTE] Accept MRQL into the Incubator
Date Thu, 07 Mar 2013 00:13:13 GMT
+1

On Thu, Mar 7, 2013 at 2:11 AM, Tommaso Teofili
<tommaso.teofili@gmail.com> wrote:
> +1
>
> Tommaso
>
>
> 2013/3/6 Alex Karasulu <akarasulu@apache.org>
>
>> +1 (binding)
>>
>>
>> On Wed, Mar 6, 2013 at 7:04 PM, Leonidas Fegaras <fegaras@cse.uta.edu
>> >wrote:
>>
>> > Dear ASF members,
>> > I would like to call for a VOTE for acceptance of MRQL into the
>> Incubator.
>> > The vote will close on Monday March 11, 2013.
>> >
>> > [ ] +1 Accept MRQL into the Apache incubator
>> > [ ] +0 Don't care.
>> > [ ] -1 Don't accept MRQL into the incubator because...
>> >
>> > Full proposal is pasted below and the corresponding wiki is
>> >
>> > http://wiki.apache.org/**incubator/MRQLProposal<
>> http://wiki.apache.org/incubator/MRQLProposal>
>> >
>> > Only VOTEs from Incubator PMC members are binding,
>> > but all are welcome to express their thoughts.
>> > Sincerely,
>> > Leonidas Fegaras
>> >
>> >
>> > = Abstract =
>> >
>> > MRQL is a query processing and optimization system for large-scale,
>> > distributed data analysis, built on top of Apache Hadoop and Hama.
>> >
>> > = Proposal =
>> >
>> > MRQL (pronounced ''miracle'') is a query processing and optimization
>> > system for large-scale, distributed data analysis. MRQL (the MapReduce
>> > Query Language) is an SQL-like query language for large-scale data
>> > analysis on a cluster of computers. The MRQL query processing system
>> > can evaluate MRQL queries in two modes: in MapReduce mode on top of
>> > Apache Hadoop or in Bulk Synchronous Parallel (BSP) mode on top of
>> > Apache Hama. The MRQL query language is powerful enough to express
>> > most common data analysis tasks over many forms of raw ''in-situ''
>> > data, such as XML and JSON documents, binary files, and CSV
>> > documents. MRQL is more powerful than other current high-level
>> > MapReduce languages, such as Hive and PigLatin, since it can operate
>> > on more complex data and supports more powerful query constructs, thus
>> > eliminating the need for using explicit MapReduce code. With MRQL,
>> > users will be able to express complex data analysis tasks, such as
>> > PageRank, k-means clustering, matrix factorization, etc, using
>> > SQL-like queries exclusively, while the MRQL query processing system
>> > will be able to compile these queries to efficient Java code.
>> >
>> > = Background =
>> >
>> > The initial code was developed at the University of Texas of Arlington
>> > (UTA) by a research team, led by Leonidas Fegaras. The software was
>> > first released in May 2011. The original goal of this project was to
>> > build a query processing system that translates SQL-like data analysis
>> > queries to efficient workflows of MapReduce jobs. A design goal was to
>> > use HDFS as the physical storage layer, without any indexing, data
>> > partitioning, or data normalization, and to use Hadoop (without
>> > extensions) as the run-time engine. The motivation behind this work
>> > was to build a platform to test new ideas on query processing and
>> > optimization techniques applicable to the MapReduce framework.
>> >
>> > A year ago, MRQL was extended to run on Hama. The motivation for this
>> > extension was that Hadoop MapReduce jobs were required to read their
>> > input and write their output on HDFS. This simplifies reliability and
>> > fault tolerance but it imposes a high overhead to complex MapReduce
>> > workflows and graph algorithms, such as PageRank, which require
>> > repetitive jobs. In addition, Hadoop does not preserve data in memory
>> > across consecutive MapReduce jobs. This restriction requires to read
>> > data at every step, even when the data is constant. BSP, on the other
>> > hand, does not suffer from this restriction, and, under certain
>> > circumstances, allows complex repetitive algorithms to run entirely in
>> > the collective memory of a cluster. Thus, the goal was to be able to
>> > run the same MRQL queries in both modes, MapReduce and BSP, without
>> > modifying the queries: If there are enough resources available, and
>> > low latency and speed are more important than resilience, queries may
>> > run in BSP mode; otherwise, the same queries may run in MapReduce
>> > mode. BSP evaluation was found to be a good choice when fault
>> > tolerance is not critical, data (both input and intermediate) can fit
>> > in the cluster memory, and data processing requires complex/repetitive
>> > steps.
>> >
>> > The research results of this ongoing work have already been published
>> > in conferences (WebDB'11, EDBT'12, and DataCloud'12) and the authors
>> > have already received positive feedback from researchers in academia
>> > and industry who were attending these conferences.
>> >
>> > = Rationale =
>> >
>> > * MRQL will be the first general-purpose, SQL-like query language for
>> > data analysis based on BSP.
>> > Currently, many programmers prefer to code their MapReduce
>> > applications in a higher-level query language, rather than an
>> > algorithmic language. For instance, Pig is used for 60% of Yahoo
>> > MapReduce jobs, while Hive is used for 90% of Facebook MapReduce
>> > jobs. This, we believe, will also be the trend for BSP applications,
>> > because, even though, in principle, the BSP model is very simple to
>> > understand, it is hard to develop, optimize, and maintain non-trivial
>> > BSP applications coded in a general-purpose programming
>> > language. Currently, there is no widely acceptable declarative BSP
>> > query language, although there are a few special-purpose BSP systems
>> > for graph analysis, such as Google Pregel and Apache Giraph, for
>> > machine learning, such as BSML, and for scientific data analysis.
>> >
>> > * MRQL can capture many complex data analysis algorithms in
>> > declarative form.
>> > Existing MapReduce query languages, such as HiveQL and PigLatin,
>> > provide a limited syntax for operating on data collections, in the
>> > form of relational joins and group-bys. Because of these limitations,
>> > these languages enable users to plug-in custom MapReduce scripts into
>> > their queries for those jobs that cannot be declaratively coded in
>> > their query language. This nullifies the benefits of using a
>> > declarative query language and may result to suboptimal, error-prone,
>> > and hard-to-maintain code. More importantly, these languages are
>> > inappropriate for complex scientific applications and graph analysis,
>> > because they do not directly support iteration or recursion in
>> > declarative form and are not able to handle complex, nested scientific
>> > data, which are often semi-structured. Furthermore, current MapReduce
>> > query processors apply traditional query optimization techniques that
>> > may be suboptimal in a MapReduce or BSP environment.
>> >
>> > * The MRQL design is modular, with pluggable distributed processing
>> > back-ends, query languages, and data formats.
>> > MRQL aims to be both powerful and adaptable. Although Hadoop is
>> > currently the most popular framework for large-scale data analysis,
>> > there are a few alternatives that are currently shaping form,
>> > including frameworks based on BSP (eg, Giraph, Pregel, Hama), MPI
>> > (eg, OpenMPI), etc. MRQL was designed in such a way so that it will
>> > be easy to support other distributed processing frameworks in the
>> > future. As an evidence of this claim, the MRQL processor required
>> > only 2K extra lines of Java code to support BSP evaluation.
>> >
>> > = Initial Goals =
>> >
>> > Some current goals include:
>> >
>> > * apply MRQL to graph analysis problems, such as k-means clustering
>> > and PageRank
>> >
>> > * apply MRQL to large-scale scientific analysis (develop general
>> > optimization techniques that can apply to matrix multiplication,
>> > matrix factorization, etc)
>> >
>> > * process additional data formats, such as Avro, and column-based
>> > stores, such as HBase
>> >
>> > * map MRQL to additional distributed processing frameworks, such as
>> > Spark and OpenMPI
>> >
>> > * extend the front-end to process more query languages, such as
>> > standard SQL, SPARQL, XQuery, and PigLatin
>> >
>> > = Current Status =
>> >
>> > The current MRQL release (version 0.8.10) is a beta release. It is
>> > built on top of Hadoop and Hama (no extensions are needed). It
>> > currently works on Hadoop up to 1.0.4 (but not on Yarn yet) and Hama
>> > 0.5.0. It has only been tested on a small cluster of 20 nodes (80
>> > cores).
>> >
>> > == Meritocracy ==
>> >
>> > The initial MRQL code base was developed by Leonidas Fegaras in May
>> > 2011, and was continuously improved throughout the years. We will
>> > reach out other potential contributors through open forums. We plan
>> > to do everything possible to encourage an environment that supports a
>> > meritocracy, where contributors will extend their privileges based on
>> > their contribution. MRQL's modular design will facilitate the
>> > strategic extensions to various modules, such as adding a standard-SQL
>> > interface, introducing new optimization techniques, etc.
>> >
>> > == Community ==
>> >
>> > The interest in open-source query processing systems for analyzing
>> > large datasets has been steadily increased in the last few years.
>> > Related Apache projects have already attracted a very large community
>> > from both academia and industry. We expect that MRQL will also
>> > establish an active community. Several researchers from both academia
>> > and industry who are interested in using our code have already
>> > contacted us.
>> >
>> > == Core Developers ==
>> >
>> > The initial core developer was Leonidas Fegaras, who wrote the
>> > majority of the code. He is an associate professor at UTA, with
>> > interests in cloud computing, databases, web technologies, and
>> > functional programming. He has an extensive knowledge and working
>> > experience in building complex query processing systems for databases,
>> > and compilers for functional and algorithmic programming languages.
>> >
>> > == Alignment ==
>> >
>> > MRQL is built on top of two Apache projects: Hadoop and Hama. We have
>> > plans to incorporate other products from the Hadoop ecosystem, such as
>> > Avro and HBase. MRQL can serve as a testbed for fine-tuning and
>> > evaluating the performance of the Apache Hama system. Finally, the
>> > MRQL query language and processor can be used by Apache Drill as a
>> > pluggable query language.
>> >
>> > = Known Risks =
>> >
>> > == Orphaned Products ==
>> >
>> > The initial committer is from academia, which may be a risk, since
>> > research in academia is publication-driven, rather than
>> > product-driven. It happens very often in academic research, when a
>> > project becomes outdated and doesn't produce publishable results, to
>> > be abandoned in favor of new cutting-edge projects. We do not believe
>> > that this will be the case for MRQL for the years to come, because it
>> > can be adapted to support new query languages, new optimization
>> > techniques, and new distributed back-ends, thus sustaining enough
>> > research interest. Another risk is that, when graduate students who
>> > write code graduate, they may leave their work undocumented and
>> > unfinished. We will strive to gain enough momentum to recruit
>> > additional committers from industry in order to eliminate these risks.
>> >
>> > == Inexperience with Open Source ==
>> >
>> > The initial developer has been involved with various projects whose
>> > source code has been released under open source license, but he has no
>> > prior experience on contributing to open-source projects. With the
>> > guidance from other more experienced committers and participants, we
>> > expect that the meritocracy rules will have a positive influence on
>> > this project.
>> >
>> > == Homogeneous Developers ==
>> >
>> > The initial committer comes from academia. However, given the interest
>> > we have seen in the project, we expect the diversity to improve in the
>> > near future.
>> >
>> > == Reliance on Salaried Developers ==
>> >
>> > Currently, the MRQL code was developed on the committer's volunteer
>> > time. In the future, UTA graduate students who will do some of the
>> > coding may be supported by UTA and funding agencies, such as NSF.
>> >
>> > == Relationships with Other Apache Products ==
>> >
>> > MRQL has some overlapping functionality with Hive and Tajo, which are
>> > Data Warehouse systems for Hadoop, and with Drill, which is an
>> > interactive data analysis system that can process nested data. MRQL
>> > has a more powerful data model, in which any form of nested data, such
>> > as XML and JSON, can be defined as a user-defined datatype. More
>> > importantly, complex data analysis tasks, such as PageRank, k-means
>> > clustering, and matrix multiplication and factorization, can be
>> > expressed as short SQL-like queries, while the MRQL system is able to
>> > evaluate these queries efficiently. Furthermore, the MRQL system can
>> > run these queries in BSP mode, in addition to MapReduce mode, thus
>> > achieving low latency and speed, which are also Drill's goals.
>> > Nevertheless, we will welcome and encourage any help from these
>> > projects and we will be eager to make contributions to these projects
>> > too.
>> >
>> > == An Excessive Fascination with the Apache Brand ==
>> >
>> > The Apache brand is likely to help us find contributors and reach out
>> > to the open-source community. Nevertheless, since MRQL depends on
>> > Apache projects (Hadoop and Hama), it makes sense to have our software
>> > available as part of this ecosystem.
>> >
>> > = Documentation =
>> >
>> > Information about MRQL can be found at http://lambda.uta.edu/mrql/
>> >
>> > = Initial Source =
>> >
>> > The initial MRQL code has been released as part of a research project
>> > developed at the University of Texas at Arlington under the Apache 2.0
>> > license for the past two years. The source code is currently hosted
>> > on GitHub at: https://github.com/fegaras/**mrql<
>> https://github.com/fegaras/mrql>MRQL’s release artifact
>> > would consist of a single tarball of packaging and test code.
>> >
>> > = External Dependencies =
>> >
>> > The MRQL source code is already licensed under the Apache License,
>> > Version 2.0. MRQL uses JLine which is distributed under the BSD
>> > license.
>> >
>> > = Cryptography =
>> >
>> > Not applicable.
>> >
>> > = Required Resources =
>> >
>> > == Mailing Lists ==
>> >
>> > * mrql-private
>> > * mrql-dev
>> > * mrql-user
>> >
>> > == Subversion Directory ==
>> >
>> > * Git is the preferred source control system:
>> > git://git.apache.org/mrql
>> >
>> > == Issue Tracking ==
>> >
>> > * A JIRA issue tracker, MRQL
>> >
>> > == Wiki ==
>> >
>> >  * Moinmoin wiki, http://wiki.apache.org/mrql
>> >
>> > = Initial Committers =
>> >
>> > * Leonidas Fegaras <fegaras AT cse DOT uta DOT edu>
>> > * Upa Gupta <upa.gupta AT mavs DOT uta DOT edu>
>> > * Edward J. Yoon <edwardyoon AT apache DOT org>
>> > * Maqsood Alam <maqsoodalam AT hotmail DOT com>
>> > * John Hope <john.hope AT oracle DOT com>
>> > * Mark Wall <mark.wall AT oracle DOT com>
>> > * Kuassi Mensah <kuassi.mensah AT oracle DOT com>
>> > * Ambreesh Khanna <ambreesh.khanna AT oracle DOT com>
>> > * Karthik Kambatla <kasha AT cloudera DOT com>
>> >
>> > = Affiliations =
>> >
>> > * Leonidas Fegaras (University of Texas at Arlington)
>> > * Upa Gupta (University of Texas at Arlington)
>> > * Edward J. Yoon (Oracle corp)
>> > * Maqsood Alam (Oracle corp)
>> > * John Hope (Oracle corp)
>> > * Mark Wall (Oracle corp)
>> > * Kuassi Mensah (Oracle corp)
>> > * Ambreesh Khanna (Oracle corp)
>> > * Karthik Kambatla (Cloudera)
>> >
>> > = Sponsors =
>> >
>> > == Champion ==
>> >
>> > * Edward J. Yoon <edwardyoon AT apache DOT org>
>> >
>> > == Nominated Mentors ==
>> >
>> > * Alex Karasulu <akarasulu AT apache DOT org>
>> > * Edward J. Yoon <edwardyoon AT apache DOT org>
>> >
>> > == Sponsoring Entity ==
>> >
>> > Incubator PMC
>> >
>> >
>>
>>
>> --
>> Best Regards,
>> -- Alex
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message