incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Trivial Update of "MRQLProposal" by LeonidasFegaras
Date Sat, 02 Mar 2013 01:32:20 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "MRQLProposal" page has been changed by LeonidasFegaras:
http://wiki.apache.org/incubator/MRQLProposal?action=diff&rev1=12&rev2=13

Comment:
some cosmetic changes (made it ready for proposal)

  
  = Proposal =
  
- MRQL (pronounced ''miracle'') is a query processing and optimization system for large-scale,
distributed data analysis. MRQL (the !MapReduce Query Language) is an SQL-like query language
for large-scale data analysis on a cluster of computers. The MRQL query processing system
can evaluate MRQL queries in two modes: in !MapReduce mode on top of [[http://hama.apache.org/|Apache
Hadoop]] or in Bulk Synchronous Parallel (BSP) mode on top of [[http://hama.apache.org/|Apache
Hama]]. The MRQL query language is powerful enough to express most common data analysis tasks
over many forms of raw ''in-situ'' data, such as XML and JSON documents, binary files, and
CSV documents. MRQL is more powerful than other current high-level !MapReduce languages, such
as Hive and Pig Latin, since it can operate on more complex data and supports more powerful
query constructs, thus eliminating the need for using explicit !MapReduce code. With MRQL,
users will be able to express complex data analysis tasks, such as !PageRank, k-means clustering,
matrix factorization, etc, using SQL-like queries exclusively, while the MRQL query processing
system will be able to compile these queries to efficient Java code.
+ MRQL (pronounced ''miracle'') is a query processing and optimization system for large-scale,
distributed data analysis. MRQL (the !MapReduce Query Language) is an SQL-like query language
for large-scale data analysis on a cluster of computers. The MRQL query processing system
can evaluate MRQL queries in two modes: in !MapReduce mode on top of [[http://hama.apache.org/|Apache
Hadoop]] or in Bulk Synchronous Parallel (BSP) mode on top of [[http://hama.apache.org/|Apache
Hama]]. The MRQL query language is powerful enough to express most common data analysis tasks
over many forms of raw ''in-situ'' data, such as XML and JSON documents, binary files, and
CSV documents. MRQL is more powerful than other current high-level !MapReduce languages, such
as Hive and !PigLatin, since it can operate on more complex data and supports more powerful
query constructs, thus eliminating the need for using explicit !MapReduce code. With MRQL,
users will be able to express complex data analysis tasks, such as !PageRank, k-means clustering,
matrix factorization, etc, using SQL-like queries exclusively, while the MRQL query processing
system will be able to compile these queries to efficient Java code.
  
  = Background =
  
@@ -19, +19 @@

  = Rationale =
  
   * MRQL will be the first general-purpose, SQL-like query language for data analysis based
on BSP.
- Currently, many programmers prefer to code their !MapReduce applications in a higher-level
query language, rather than an algorithmic language. For instance, Pig is used for 60% of
Yahoo! !MapReduce jobs, while Hive is used for 90% of Facebook !MapReduce jobs. This, we believe,
will also be the trend for BSP applications, because, even though, in principle, the BSP model
is very simple to understand, it is hard to develop, optimize, and maintain non-trivial BSP
applications coded in a general-purpose programming language. Currently, there is no widely
acceptable declarative BSP query language, although there are a few special-purpose BSP systems
for graph analysis, such as Google Pregel and [[http://incubator.apache.org/giraph/|Apache
Giraph]], for machine learning, such as [[http://frederic.loulergue.eu/research/bsml/main.html|BSML]],
and for scientific data analysis.
+  . Currently, many programmers prefer to code their !MapReduce applications in a higher-level
query language, rather than an algorithmic language. For instance, Pig is used for 60% of
Yahoo! !MapReduce jobs, while Hive is used for 90% of Facebook !MapReduce jobs. This, we believe,
will also be the trend for BSP applications, because, even though, in principle, the BSP model
is very simple to understand, it is hard to develop, optimize, and maintain non-trivial BSP
applications coded in a general-purpose programming language. Currently, there is no widely
acceptable declarative BSP query language, although there are a few special-purpose BSP systems
for graph analysis, such as Google Pregel and [[http://incubator.apache.org/giraph/|Apache
Giraph]], for machine learning, such as [[http://frederic.loulergue.eu/research/bsml/main.html|BSML]],
and for scientific data analysis.
  
   * MRQL can capture many complex data analysis algorithms in declarative form.
- Existing !MapReduce query languages, such as HiveQL and !PigLatin, provide a limited syntax
for operating on data collections, in the form of relational joins and group-bys. Because
of these limitations, these languages enable users to plug-in custom !MapReduce scripts into
their queries for those jobs that cannot be declaratively coded in their query language. This
nullifies the benefits of using a declarative query language and may result to suboptimal,
error-prone, and hard-to-maintain code. More importantly, these languages are inappropriate
for complex scientific applications and graph analysis, because they do not directly support
iteration or recursion in declarative form and are not able to handle complex, nested scientific
data, which are often semi-structured. Furthermore, current !MapReduce query processors apply
traditional query optimization techniques that may be suboptimal in a !MapReduce or BSP environment.
+  . Existing !MapReduce query languages, such as HiveQL and !PigLatin, provide a limited
syntax for operating on data collections, in the form of relational joins and group-bys. Because
of these limitations, these languages enable users to plug-in custom !MapReduce scripts into
their queries for those jobs that cannot be declaratively coded in their query language. This
nullifies the benefits of using a declarative query language and may result to suboptimal,
error-prone, and hard-to-maintain code. More importantly, these languages are inappropriate
for complex scientific applications and graph analysis, because they do not directly support
iteration or recursion in declarative form and are not able to handle complex, nested scientific
data, which are often semi-structured. Furthermore, current !MapReduce query processors apply
traditional query optimization techniques that may be suboptimal in a !MapReduce or BSP environment.
  
-  * The MRQL design is modular and adaptable, with pluggable distributed processing back-ends,
query languages, and data formats.
+  * The MRQL design is modular, with pluggable distributed processing back-ends, query languages,
and data formats.
- MRQL's goal is to be powerful as well as adaptable. Although Hadoop is currently the most
popular framework for large-scale data analysis, there are a few alternatives that are currently
shaping form, including frameworks based on BSP (eg, Giraph, Pregel, Hama), MPI (eg, OpenMPI),
etc. MRQL was designed in such a way so that it will be easy to support other distributed
processing frameworks in the future. As an evidence of this claim, the MRQL processor required
only 2K extra lines of Java code to support BSP evaluation.
+  . MRQL aims to be both powerful and adaptable. Although Hadoop is currently the most popular
framework for large-scale data analysis, there are a few alternatives that are currently shaping
form, including frameworks based on BSP (eg, Giraph, Pregel, Hama), MPI (eg, OpenMPI), etc.
MRQL was designed in such a way so that it will be easy to support other distributed processing
frameworks in the future. As an evidence of this claim, the MRQL processor required only 2K
extra lines of Java code to support BSP evaluation.
  
  = Initial Goals =
  
  Some current goals include:
  
-  * apply MRQL to graph analysis problems, such as k-means clustering and pagerank
+  * apply MRQL to graph analysis problems, such as k-means clustering and !PageRank
  
   * apply MRQL to large-scale scientific analysis (develop general optimization techniques
that can apply to matrix multiplication, matrix factorization, etc)
  
   * process additional data formats, such as [[http://avro.apache.org/|Avro]], and column-based
stores, such as [[http://hbase.apache.org/|HBase]]
  
   * map MRQL to additional distributed processing frameworks, such as [[http://spark-project.org/|Spark]]
and [[http://www.open-mpi.org/|OpenMPI]]
+ 
+  * extend the front-end to process more query languages, such as standard SQL, SPARQL, XQuery,
and !PigLatin
  
  = Current Status =
  
@@ -94, +96 @@

  
  == Relationships with Other Apache Products ==
  
- MRQL has some overlapping functionality with [[http://hive.apache.org/|Apache Hive]] and
[[DrillProposal|Apache Drill]].
+ MRQL has some overlapping functionality with [[http://hive.apache.org/|Hive]] and [[TajoProposal|Tajo]],
which are Data Warehouse systems for Hadoop,
+ and with [[DrillProposal|Drill]], which is an interactive data analysis system that can
process nested data. MRQL has a more powerful data model, in which any form of nested data,
such as XML and JSON, can be defined as a user-defined datatype. More importantly, complex
data analysis tasks, such as !PageRank, k-means clustering, and matrix multiplication and
factorization, can be expressed as short SQL-like queries, while the MRQL system is able to
evaluate these queries efficiently. More importantly, the MRQL system can run these queries
in BSP mode, in addition to !MapReduce mode, thus achieving low latency and speed, which is
one of the Drill's goals. Nevertheless, we will welcome and encourage any help from these
projects and we will be eager to make contributions to these projects too.
  
  == An Excessive Fascination with the Apache Brand ==
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message