hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/LanguageManual/SortBy" by Ning Zhang
Date Tue, 15 Jun 2010 17:31:57 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/LanguageManual/SortBy" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/LanguageManual/SortBy?action=diff&rev1=5&rev2=6

--------------------------------------------------

  <<TableOfContents>>
+ 
+ == Syntax of Order By ==
+ The ''ORDER BY'' syntax in Hive QL is similar to the syntax of ''ORDER BY'' in SQL language.
+ 
+ {{{
+ colOrder: ( ASC | DESC )
+ orderBy: ORDER BY colName colOrder? (',' colName colOrder?)*
+ query: SELECT expression (',' expression)* FROM src orderBy
+ }}}
+ 
+ There are some limitations in the "order by" clause. In the strict mode (i.e., hive.mapred.mode=strict),
the order by clause has to be followed by a "limit" clause. The limit clause is not necessary
if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order
of all results, there has to be one reducer to sort the final output. If the number of rows
in the output is too large, the single reducer could take a very long time to finish. 
  
  == Syntax of Sort By ==
  The ''SORT BY'' syntax is similar to the syntax of ''ORDER BY'' in SQL language.
@@ -11, +22 @@

  query: SELECT expression (',' expression)* FROM src sortBy
  }}}
  
- Hive uses the columns in ''SORT BY'' to sort the rows before feeding the rows to a single
reducer.  The sort order will be dependent on the column types.  If the column is of numeric
type, then the sort order is also in numeric order.  If the column is of string type, then
the sort order will be lexicographical order.
+ Hive uses the columns in ''SORT BY'' to sort the rows before feeding the rows to a reducer.
 The sort order will be dependent on the column types.  If the column is of numeric type,
then the sort order is also in numeric order.  If the column is of string type, then the sort
order will be lexicographical order.
  
  
  === Difference between Sort By and Order By ===
+ Hive supports ''SORT BY'' which sorts the data per reducer. The difference between "order
by" and "sort by" is that the former guarantees total order in the output while the latter
only guarantees ordering of the rows withint a reducer. If there are more than one reducer,
"sort by" may give partially ordered final results. 
- Most database systems supports ''ORDER BY'', which Hive does not support directly.
- Hive supports ''SORT BY'' which sorts the data per reducer.
  
  Basically, the data in each reducer will be sorted according to the order that the user
specified.  The following example shows 
  
@@ -38, +48 @@

  1   1
  2   5
  }}}
- 
- 
- === Simulating Order By ===
- 
- We can set the number of reducers to 1, to make sure we have the same result as ''ORDER
BY''.
- 
- {{{
- set mapred.reduce.tasks=1;
- SELECT key, value FROM src SORT BY key ASC, value DESC;
- }}}
- 
- This sometimes will make the reducer a performance bottleneck.  A lot of cases the user
only wants to see the top N rows where N is a small number.  In this case, we can use LIMIT
clause.  We don't have an example here but users are encouraged to provide one.
  
  === Setting Types for Sort By ===
  

Mime
View raw message