hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "Hive/HiveQL/Transform" by ZhengShao
Date Wed, 21 Jan 2009 23:59:41 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by ZhengShao:
http://wiki.apache.org/hadoop/Hive/HiveQL/Transform

------------------------------------------------------------------------------
      MAP expression (, expression)*
      USING 'my_map_script'
      ( AS colName (, colName)* )?
-     ( clusterBy? | distributeBy? sortBy? )
+     ( clusterBy? | distributeBy? sortBy? ) src_alias
    )
    REDUCE expression (, expression)*
      USING 'my_reduce_script'
      ( AS colName (, colName)* )?
  }}}
  
+ Example:
+ {{{
+   FROM (
+     FROM pv_users
+     MAP pv_users.userid, pv_users.date
+     USING 'map_script'
+     AS dt, uid
+     CLUSTER BY dt) map_output
+   INSERT OVERWRITE TABLE pv_users_reduced
+     REDUCE map_output.dt, map_output.uid
+     USING 'reduce_script'
+     AS date, count;
+ }}}
  
- === Cluster By/Distribute By/Sort By ===
- ''clusterBy'' is a short-cut for both ''distributeBy'' and ''sortBy''.
- 
- Hive uses the columns in ''distributeBy'' to distribute the rows among reducers.  All rows
with the same ''distributeBy'' columns will go to the same reducer.
- 
- Hive uses the columns in ''sortBy'' to sort the rows before feeding the rows to a single
reducer.  The sort order will be dependent on the column types.  If the column is of numeric
type, then the sort order is also in numeric order.  If the column is of string type, then
the sort order will be lexicographical order.
- 
- === Schema-less Map-reduce Scripts ===
+ == Schema-less Map-reduce Scripts ==
  If there is no ''AS'' clause after ''USING my_script'', Hive assumes the output of the script
contains 2 parts: key which is before the first tab, and value which is the rest after the
first tab.  Note that this is different from specifying ''AS key, value'' because in that
case value will only contains the portion between the first tab and the second tab if there
are multiple tabs.
  
- 
- == Transform/Map-Reduce Examples ==
- 
- Schema-less Map-reduce: Note that we can directly do ''CLUSTER BY key'' without specifying
the output schema of the scripts.
+ Note that we can directly do ''CLUSTER BY key'' without specifying the output schema of
the scripts.
  {{{
    FROM (
      FROM pv_users
@@ -57, +60 @@

      AS date, count;
  }}}
  
+ 
+ == Cluster By/Distribute By/Sort By ==
+ ''clusterBy'' is a short-cut for both ''distributeBy'' and ''sortBy''.
+ 
+ Hive uses the columns in ''distributeBy'' to distribute the rows among reducers.  All rows
with the same ''distributeBy'' columns will go to the same reducer.
+ 
+ Hive uses the columns in ''sortBy'' to sort the rows before feeding the rows to a single
reducer.  The sort order will be dependent on the column types.  If the column is of numeric
type, then the sort order is also in numeric order.  If the column is of string type, then
the sort order will be lexicographical order.
+ 
- ''Distribute By'' and ''Sort By'': Instead of specifying "cluster by", the user can specify
"distribute by" and "sort by", so the partition columns and sort columns can be different.
The usual case is that the partition columns are a prefix of sort columns, but that is not
required.
+ Instead of specifying "cluster by", the user can specify "distribute by" and "sort by",
so the partition columns and sort columns can be different. The usual case is that the partition
columns are a prefix of sort columns, but that is not required.
  
  : FROM (
  ::  FROM pv_users

Mime
View raw message