hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/LanguageManual/DDL/BucketedTables" by PaulYang
Date Thu, 01 Apr 2010 23:15:58 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/LanguageManual/DDL/BucketedTables" page has been changed by PaulYang.
http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL/BucketedTables?action=diff&rev1=8&rev2=9

--------------------------------------------------

  SELECT userid, firstname, lastname WHERE ds='2009-02-25';
  }}}
  
- The command {{{set hive.enforce.bucketing = true; }}} allows the correct number of reducers
and the cluster by column to be automatically selected based on the table. Otherwise, you
would need to set the number of reducers to be the same as the number of buckets a la {{{set
mapred.reduce.tasks = 256;}}} and have {{{CLUSTER BY ...}}} clause in the select.
+ The command {{{set hive.enforce.bucketing = true; }}} allows the correct number of reducers
and the cluster by column to be automatically selected based on the table. Otherwise, you
would need to set the number of reducers to be the same as the number of buckets a la {{{set
mapred.reduce.tasks = 256;}}} and have a {{{CLUSTER BY ...}}} clause in the select.
  
- How does Hive distribute the rows across the buckets? In general, the bucket number is determined
by the expression {{{hash_function(bucketing_column) mod num_buckets}}}. (There's a '0x7FFFFFFF
in there too, but that's not that important). The hash_function depends on the type of the
bucketing column. For an int, it's easy, {{{hash_int(i) == i}}}. For example, if user_id were
an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket
1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little
tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of
a string or a complex datatype will be some number that's derived from the value, but not
anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in
bucket 1 would probably not end in 0. In general, though, distributing rows based on the hash
will give you a even distribution in the buckets.
+ How does Hive distribute the rows across the buckets? In general, the bucket number is determined
by the expression {{{hash_function(bucketing_column) mod num_buckets}}}. (There's a '0x7FFFFFFF
in there too, but that's not that important). The hash_function depends on the type of the
bucketing column. For an int, it's easy, {{{hash_int(i) == i}}}. For example, if user_id were
an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket
1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little
tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of
a string or a complex datatype will be some number that's derived from the value, but not
anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in
bucket 1 would probably not end in 0. In general, distributing rows based on the hash will
give you a even distribution in the buckets.
  
- So, what can go wrong? As long as you {{{set hive.enforce.bucketing = true}}}, and use the
syntax above, the tables should be populated properly. Things can go wrong if the bucketing
column type is different during the insert and on read, or if you manually cluster by a value
that's different from the table definition 
+ So, what can go wrong? As long as you {{{set hive.enforce.bucketing = true}}}, and use the
syntax above, the tables should be populated properly. Things can go wrong if the bucketing
column type is different during the insert and on read, or if you manually cluster by a value
that's different from the table definition.
  

Mime
View raw message