hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "Hive/LanguageManual/Transform" by ZhengShao
Date Thu, 22 Jan 2009 22:54:13 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by ZhengShao:
http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform

------------------------------------------------------------------------------
  In the syntax, both ''MAP'' and ''REDUCE'' can be also written as ''SELECT TRANSFORM''.
 There are actually no difference between these three.
  Hive runs the reduce script in the reduce task (instead of the map task) because of the
''clusterBy''/''distributeBy''/''sortBy'' clause in the inner query.
  
+ Please also see [wiki:Self:Hive/LanguageManual/SortBy Sort By / Cluster By / Distribute
By].
+ 
  {{{
  clusterBy: CLUSTER BY colName (',' colName)*
  distributeBy: DISTRIBUTE BY colName (',' colName)*
- sortBy: SORT BY colName (',' colName)*
+ sortBy: SORT BY colName (ASC | DESC)? (',' colName (ASC | DESC)?)*
  
  query:
    FROM (
@@ -61, +63 @@

      AS date, count;
  }}}
  
- 
- == Cluster By/Distribute By/Sort By ==
- ''clusterBy'' is a short-cut for both ''distributeBy'' and ''sortBy''.
- 
- Hive uses the columns in ''distributeBy'' to distribute the rows among reducers.  All rows
with the same ''distributeBy'' columns will go to the same reducer.
- 
- Hive uses the columns in ''sortBy'' to sort the rows before feeding the rows to a single
reducer.  The sort order will be dependent on the column types.  If the column is of numeric
type, then the sort order is also in numeric order.  If the column is of string type, then
the sort order will be lexicographical order.
- 
- Instead of specifying "cluster by", the user can specify "distribute by" and "sort by",
so the partition columns and sort columns can be different. The usual case is that the partition
columns are a prefix of sort columns, but that is not required.
- 
- {{{
-   FROM (
-     FROM pv_users
-     MAP ( pv_users.userid, pv_users.date )
-     USING 'map_script'
-     AS c1, c2, c3
-     DISTRIBUTE BY c2
-     SORT BY c2, c1) map_output
-   INSERT OVERWRITE TABLE pv_users_reduced
-     REDUCE ( map_output.c1, map_output.c2, map_output.c3 )
-     USING 'reduce_script'
-     AS date, count;
- }}}
- 

Mime
View raw message