hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "Hive/LanguageManual/SortBy" by ZhengShao
Date Thu, 22 Jan 2009 22:52:34 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by ZhengShao:

  sortBy: SORT BY colName colOrder? (',' colName colOrder?)*
  query: SELECT expression (',' expression)* FROM src sortBy
+ Hive uses the columns in ''SORT BY'' to sort the rows before feeding the rows to a single
reducer.  The sort order will be dependent on the column types.  If the column is of numeric
type, then the sort order is also in numeric order.  If the column is of string type, then
the sort order will be lexicographical order.
  == Difference between Sort By and Order By ==
  Most database systems supports ''ORDER BY'', which Hive does not support directly.
@@ -46, +49 @@

  SELECT key, value FROM src SORT BY key ASC, value DESC;
+ This sometimes will make the reducer a performance bottleneck.  A lot of cases the user
only wants to see the top N rows where N is a small number.  In this case, we can use LIMIT
clause.  We don't have an example here but users are encouraged to provide one.
+ == Syntax of Cluster By and Distribute By ==
+ ''Cluster By'' and ''Distribute By'' are used mainly with the [wiki:Self:Hive/LanguageManual/Transform
Transform/Map-Reduce Scripts]. But, it is sometimes useful in SELECT statements if there is
a need to partition and sort the output of a query for subsequent queries.
+ ''Cluster By'' is a short-cut for both ''Distribute By'' and ''Sort By''.
+ Hive uses the columns in ''Distribute By'' to distribute the rows among reducers.  All rows
with the same ''Distribute By'' columns will go to the same reducer.
+ Instead of specifying ''Cluster By'', the user can specify ''Distribute By'' and ''Sort
By'', so the partition columns and sort columns can be different. The usual case is that the
partition columns are a prefix of sort columns, but that is not required.
+ {{{
+ SELECT col1, col2 FROM t1 CLUSTER BY col1
+ }}}
+ {{{
+ SELECT col1, col2 FROM t1 DISTRIBUTE BY col1
+ SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC
+ }}}
+ {{{
+   FROM (
+     FROM pv_users
+     MAP ( pv_users.userid, pv_users.date )
+     USING 'map_script'
+     AS c1, c2, c3
+     SORT BY c2, c1) map_output
+   INSERT OVERWRITE TABLE pv_users_reduced
+     REDUCE ( map_output.c1, map_output.c2, map_output.c3 )
+     USING 'reduce_script'
+     AS date, count;
+ }}}

View raw message