hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/Design" by NamitJain
Date Thu, 26 Feb 2009 02:10:07 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by NamitJain:

  description of the HiveQL language see the [wiki:Self:Hive/LanguageManual language manual].
  == Compiler ==
+  * Parser - Transform query string to a parse tree representation
+  * Semantic Analyser - Transform the parse tree to an internal query representation, which
is still block based and not an operator tree. As part of this step, the column names are
verified and expansions like * are performed. Type-checking and any implicit type conversions
are also performed at this stage. If the table under consideration is a partitioned table,
which is the common scenario, all the expressions for that table are collected so that they
can be later used to prune the partitions which are not needed. If the query has specified
sampling, that is also collected to be used later on.
+  * Logical Plan Generator - Convert the internal query representation to a logical plan,
which consists of a tree of operators. Some of the operators are relational algebra operators
like 'filter', 'join' etc. But some of the operators are hive specific and are used later
on to convert this plan into a series of map-reduce jobs. One such operator is a ReduceSink
operator which occurs at the map-reduce boundary. This step also includes the optimizer to
transform the plan to improve performance - some of those transformations include: converting
a series of joins into a single multi-way join, performing a map-side partial aggregation
for a group-by, performing a group-by in 2 stages to avoid the scenario when a single reducer
can become a bottleneck in presence of skewed data for the grouping key. Each operator comprises
of a descriptor which is a serializable object.
+  * Query Plan Generator - Convert the logical plan to a series of map-reduce tasks. The
operator tree is recursively traversed, to be broken up into a series of map-reduce serializable
tasks which can be submitted later on to the map-reduce framework for the hadoop distributed
file system. The ReduceSink operator is the map-reduce boundary, whose descriptor contains
the reduction keys. The reduction keys in the ReduceSink descriptor are used to as the reduction
keys in the map-reduce boundary. The plan only consists of the required samples/partitions
if the query specified so.
  == Optimizer ==

View raw message