hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "Hive/LanguageManual/DDL/BucketedTables" by ZhengShao
Date Tue, 12 Jan 2010 02:31:34 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/LanguageManual/DDL/BucketedTables" page has been changed by ZhengShao.
http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL/BucketedTables?action=diff&rev1=6&rev2=7

--------------------------------------------------

  set mapred.reduce.tasks = 256;    
  FROM (
      FROM user_info u
-     SELECT CAST(userid AS BIGINT) AS userid_bigint, firstname, lastname
+     SELECT CAST(userid AS BIGINT) % 256 AS bucket_id, userid, firstname, lastname
      WHERE d.ds='2009-02-25'
-     CLUSTER BY userid_bigint
+     CLUSTER BY bucket_id
      ) c
  INSERT OVERWRITE TABLE user_info_bucketed
  PARTITION (ds='2009-02-25')
- SELECT userid_bigint, firstname, lastname;
+ SELECT userid, firstname, lastname;
  }}}
  
  Note that I’m clustering by the integer version of userid.  This might otherwise cluster
by userid as a STRING (depending on the type of userid in user_info), which uses a totally
different hash.  It's important for the hashing function to be of the correct data type, since
otherwise we'll expect userids in bucket 1 to satisfy (big_hash(userid) mod 256 == 0), but
instead we'll be getting (string_hash(userid) mod 256 == 0).  It's also good form to have
all of your tables use the same type (eg, BIGINT instead of STRING) since that way your sampling
from multiple tables will give you the same userids, letting join efficiently sample and join.

Mime
View raw message