hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/LanguageManual/SortBy" by Ning Zhang
Date Thu, 02 Sep 2010 18:34:25 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/LanguageManual/SortBy" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/LanguageManual/SortBy?action=diff&rev1=7&rev2=8

--------------------------------------------------

  
  ''Cluster By'' is a short-cut for both ''Distribute By'' and ''Sort By''.
  
- Hive uses the columns in ''Distribute By'' to distribute the rows among reducers.  All rows
with the same ''Distribute By'' columns will go to the same reducer.
+ Hive uses the columns in ''Distribute By'' to distribute the rows among reducers.  All rows
with the same ''Distribute By'' columns will go to the same reducer. However, ''Distribute
By'' does not guarantee clustering or sorting properties on the distributed keys. For example,
we are distributing 5 rows to 2 reducer by column x whose values are x1, x1, x2, x3, and x4.
Reducer 1 got x1, x2, x1, and reducer 2 got x3 and x4. Note that all rows with the same key
x1 is guaranteed to be distributed to the same reducer (reducer 1 in this case), but the order
of rows does not guarantee that all rows with x1 as key be clustered in adjacent order. 
  
  Instead of specifying ''Cluster By'', the user can specify ''Distribute By'' and ''Sort
By'', so the partition columns and sort columns can be different. The usual case is that the
partition columns are a prefix of sort columns, but that is not required.
  

Mime
View raw message