hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "Hive/HiveQL/Transform" by ZhengShao
Date Wed, 21 Jan 2009 23:47:23 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by ZhengShao:
http://wiki.apache.org/hadoop/Hive/HiveQL/Transform

------------------------------------------------------------------------------
  
  == Transform/Map-Reduce Syntax ==
  
+ Users can also plug in their own custom mappers and reducers in the data stream by using
features natively supported in the Hive 2.0 language. e.g. in order to run a custom mapper
script - map_script - and a custom reducer script - reduce_script - the user can issue the
following command which uses the TRANSFORM clause to embed the mapper and the reducer scripts.
+ 
+ Note that columns will be transformed to string and deliminated by TAB before feeding to
the user script, and the standard output of the user script will be treated as TAB-separated
string columns. User scripts can output debug information to standard error which will be
shown on the task detail page on hadoop.
+ 
+ 
  {{{
- 
  clusterBy: CLUSTER BY colName (, colName)*
  distributeBy: DISTRIBUTE BY colName (, colName)*
  sortBy: SORT BY colName (, colName)*
@@ -14, +18 @@

  query:
    FROM (
      FROM src
-     MAP ( expression (, expression)* )
+     MAP expression (, expression)*
      USING 'my_map_script'
-     ( AS (colName (, colName)* ) )?
+     ( AS colName (, colName)* )?
      ( clusterBy? | distributeBy? sortBy? )
    )
-   REDUCE ( expression (, expression)* )
+   REDUCE expression (, expression)*
      USING 'my_reduce_script'
-     ( AS (colName (, colName)* ) )?
+     ( AS colName (, colName)* )?
  
  }}}
  
+ Both ''MAP'' and ''REDUCE'' can be also written as ''SELECT TRANSFORM''.  There are actually
no difference between these three.
+ Hive runs the reduce script in the reduce task because of the ''clusterBy''/''distributeBy''/''sortBy''
clause.
+ 
+ ''clusterBy'' is a short-cut for both ''distributeBy'' and ''sortBy''.
+ 
+ Hive uses the columns in ''distributeBy'' to distribute the rows among reducers.  All rows
with the same ''distributeBy'' columns will go to the same reducer.
+ 
+ Hive uses the columns in ''sortBy'' to sort the rows before feeding the rows to a single
reducer.  The sort order will be dependent on the column types.  If the column is of numeric
type, then the sort order is also in numeric order.  If the column is of string type, then
the sort order will be lexicographical order.
+ 
+ 
  == Transform/Map-Reduce Examples ==
  

Mime
View raw message