incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "MRQLProposal" by LeonidasFegaras
Date Thu, 28 Feb 2013 23:27:27 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "MRQLProposal" page has been changed by LeonidasFegaras:
http://wiki.apache.org/incubator/MRQLProposal?action=diff&rev1=7&rev2=8

Comment:
Added Rational, etc

  
  = Proposal =
  
- MRQL (pronounced ''miracle'') is a query processing and optimization system for large-scale,
distributed data analysis. MRQL (the Map-Reduce Query Language) is an SQL-like query language
for large-scale data analysis on a cluster of computers. The MRQL query processing system
can execute MRQL queries in two modes: in Map-Reduce mode on top of [[http://hama.apache.org/|Apache
Hadoop]] or in Bulk Synchronous Parallel (BSP) mode on top of [[http://hama.apache.org/|Apache
Hama]]. The MRQL query language is powerful enough to express most common data analysis tasks
over many forms of raw data, such as XML and JSON documents, binary files, and line-oriented
text documents with comma-separated values. MRQL is more powerful than other current high-level
Map-Reduce languages, such as Hive and Pig Latin, since it can operate on more complex data
and supports more powerful query constructs, thus eliminating the need for using explicit
Map-Reduce code. With MRQL, users will be able to express complex data analysis tasks, such
as pagerank, k-means clustering, matrix factorization, etc, using SQL-like queries exclusively,
while the MRQL query processing system will be able to compile these queries to efficient
Java code.
+ MRQL (pronounced ''miracle'') is a query processing and optimization system for large-scale,
distributed data analysis. MRQL (the !MapReduce Query Language) is an SQL-like query language
for large-scale data analysis on a cluster of computers. The MRQL query processing system
can execute MRQL queries in two modes: in !MapReduce mode on top of [[http://hama.apache.org/|Apache
Hadoop]] or in Bulk Synchronous Parallel (BSP) mode on top of [[http://hama.apache.org/|Apache
Hama]]. The MRQL query language is powerful enough to express most common data analysis tasks
over many forms of raw data, such as XML and JSON documents, binary files, and line-oriented
text documents with comma-separated values. MRQL is more powerful than other current high-level
!MapReduce languages, such as Hive and Pig Latin, since it can operate on more complex data
and supports more powerful query constructs, thus eliminating the need for using explicit
!MapReduce code. With MRQL, users will be able to express complex data analysis tasks, such
as !PageRank, k-means clustering, matrix factorization, etc, using SQL-like queries exclusively,
while the MRQL query processing system will be able to compile these queries to efficient
Java code.
  
  = Background =
  
- The initial code was developed at the University of Texas of Arlington the by a research
team, led by Leonidas Fegaras. The original goal was to build a query processing system that
translates SQL-like data analysis queries to efficient workflows of Map-Reduce jobs. Our goal
was to use HDFS as the physical storage layer, without any indexing, data partitioning, or
data normalization,
- and to use Hadoop (without extensions) as the run-time engine.
+ The initial code was developed at the University of Texas of Arlington by a research team,
led by Leonidas Fegaras. The software was first released on May'11. The original goal of this
project was to build a query processing system that translates SQL-like data analysis queries
to efficient workflows of !MapReduce jobs. A design goal was to use HDFS as the physical storage
layer, without any indexing, data partitioning, or data normalization, and to use Hadoop (without
extensions) as the run-time engine. The motivation behind this work was to built a platform
to test new ideas on query processing and optimization techniques applicable to the !MapReduce
framework.
+ 
+ One year ago, MRQL was extended to run on Hama. The motivation for this extension was that
Hadoop !MapReduce jobs were required to read their input and write their output on HDFS. This
simplifies reliability and fault tolerance but it imposes a high overhead to complex !MapReduce
workflows and graph algorithms, such as pagerank, which require repetitive jobs. In addition,
Hadoop does not preserve data in memory between the map and reduce tasks or across consecutive
!MapReduce jobs. This restriction requires to read data at every step, even when the data
is constant. BSP, on the other hand, does not suffer from this restriction, and, under certain
circumstances, allows complex repetitive algorithms to run entirely in the collective memory
of a cluster. Thus, the goal was to be able to run the same MRQL queries in both modes, !MapReduce
and BSP, without modifying the queries: If there are enough resources available and performance
is preferred over resilience, queries may run in BSP mode; otherwise, the same queries are
run in !MapReduce mode. BSP was found to be a good choice when fault tolerance is less important,
data (both input and intermediate) can fit in the cluster memory, and data processing requires
complex/repetitive steps.
+ 
+ The research results of this ongoing work have already been published in conferences (WebDB'11,
EDBT'12, and Data-Cloud'12) and the authors have already received a positive feedback from
researchers in academia and industry who were attending these conferences. 
  
  = Rationale =
+ 
+ Currently, many programmers prefer to use a higher-level query language to code
+ their !MapReduce applications, instead of coding
+ them directly in an algorithmic language, such as Java.
+ For instance, Pig is used for 60% of Yahoo! !MapReduce jobs,
+ while Hive is used for 90% of Facebook !MapReduce jobs.
+ This, we believe, will also be the trend for BSP applications.
+ because, even though, in principle, the BSP model is very
+ simple to understand, it is hard to develop, optimize, and
+ maintain non-trivial BSP applications coded in a general-purpose
+ programming language. Currently, there is no widely
+ acceptable BSP query language. Existing !MapReduce query languages,
+ such as HiveQL and !PigLatin, provide a limited syntax for
+ operating on data collections, in the form of relational joins
+ and group-bys. Because of these limitations, these languages
+ enable users to plug-in custom !MapReduce scripts into their queries
+ for those jobs that cannot be declaratively coded in their query
+ language. This nullifies the benefits of using a declarative
+ query language and may result to suboptimal, error-prone,
+ and hard-to-maintain code. More importantly, these languages
+ are inappropriate for complex scientific applications and graph
+ analysis, because they do not directly support iteration
+ or recursion in declarative form and are not able to handle complex,
+ nested scientific data, which are often semi-structured
+ Furthermore, current !MapReduce query processors apply traditional
+ query optimization techniques that may be suboptimal in a !MapReduce or BSP environment.
+ 
+ MRQL's goal is to be powerful as well as adaptable.
+ Although Hadoop is currently the most popular framework for large-scale data analysis,
+ there are many alternatives that are currently shaping form, including frameworks
+ based on BSP (eg, Giraph, Pregel, Hama), MPI (eg, OpenMPI), etc.
+ MRQL was designed in such a way so that it will be easy in the future to support other distributed
processing frameworks.
+ An evidence to support this is its ability to run queries in both !MapReduce and BSP mode.
  
  = Initial Goals =
  
@@ -29, +65 @@

  
  = Current Status =
  
- Currently, MRQL is in a beta release (version 0.8.10). It is built on top of Hadoop and
Hama (no extensions are needed).
+ The current MRQL release (version 0.8.10) is a beta release. It is built on top of Hadoop
and Hama (no extensions are needed).
  It currently works on Hadoop up to 1.0.4 (but not on Yarn yet) and Hama 0.5.0.
  It has only been tested on a small cluster of 20 nodes (80 cores).
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message