Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-dev@hadoop.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: Apache Wiki <wikidiffs@apache.org>
To: Apache Wiki <wikidiffs@apache.org>
Date: Thu, 01 Apr 2010 23:15:58 -0000
Message-ID: <20100401231558.137.74820@eos.apache.org>
Subject: 
 =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22Hive/LanguageManual/DDL/Bucketed?=
 =?utf-8?q?Tables=22_by_PaulYang?=

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch=
ange notification.

The "Hive/LanguageManual/DDL/BucketedTables" page has been changed by PaulY=
ang.
http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL/BucketedTables?action=
=3Ddiff&rev1=3D8&rev2=3D9

--------------------------------------------------

  SELECT userid, firstname, lastname WHERE ds=3D'2009-02-25';
  }}}
  =

- The command {{{set hive.enforce.bucketing =3D true; }}} allows the correc=
t number of reducers and the cluster by column to be automatically selected=
 based on the table. Otherwise, you would need to set the number of reducer=
s to be the same as the number of buckets a la {{{set mapred.reduce.tasks =
=3D 256;}}} and have {{{CLUSTER BY ...}}} clause in the select.
+ The command {{{set hive.enforce.bucketing =3D true; }}} allows the correc=
t number of reducers and the cluster by column to be automatically selected=
 based on the table. Otherwise, you would need to set the number of reducer=
s to be the same as the number of buckets a la {{{set mapred.reduce.tasks =
=3D 256;}}} and have a {{{CLUSTER BY ...}}} clause in the select.
  =

- How does Hive distribute the rows across the buckets? In general, the buc=
ket number is determined by the expression {{{hash_function(bucketing_colum=
n) mod num_buckets}}}. (There's a '0x7FFFFFFF in there too, but that's not =
that important). The hash_function depends on the type of the bucketing col=
umn. For an int, it's easy, {{{hash_int(i) =3D=3D i}}}. For example, if use=
r_id were an int, and there were 10 buckets, we would expect all user_id's =
that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in buc=
ket 2, etc. For other datatypes, it's a little tricky. In particular, the h=
ash of a BIGINT is not the same as the BIGINT. And the hash of a string or =
a complex datatype will be some number that's derived from the value, but n=
ot anything humanly-recognizable. For example, if user_id were a STRING, th=
en the user_id's in bucket 1 would probably not end in 0. In general, thoug=
h, distributing rows based on the hash will give you a even distribution in=
 the buckets.
+ How does Hive distribute the rows across the buckets? In general, the buc=
ket number is determined by the expression {{{hash_function(bucketing_colum=
n) mod num_buckets}}}. (There's a '0x7FFFFFFF in there too, but that's not =
that important). The hash_function depends on the type of the bucketing col=
umn. For an int, it's easy, {{{hash_int(i) =3D=3D i}}}. For example, if use=
r_id were an int, and there were 10 buckets, we would expect all user_id's =
that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in buc=
ket 2, etc. For other datatypes, it's a little tricky. In particular, the h=
ash of a BIGINT is not the same as the BIGINT. And the hash of a string or =
a complex datatype will be some number that's derived from the value, but n=
ot anything humanly-recognizable. For example, if user_id were a STRING, th=
en the user_id's in bucket 1 would probably not end in 0. In general, distr=
ibuting rows based on the hash will give you a even distribution in the buc=
kets.
  =

- So, what can go wrong? As long as you {{{set hive.enforce.bucketing =3D t=
rue}}}, and use the syntax above, the tables should be populated properly. =
Things can go wrong if the bucketing column type is different during the in=
sert and on read, or if you manually cluster by a value that's different fr=
om the table definition =

+ So, what can go wrong? As long as you {{{set hive.enforce.bucketing =3D t=
rue}}}, and use the syntax above, the tables should be populated properly. =
Things can go wrong if the bucketing column type is different during the in=
sert and on read, or if you manually cluster by a value that's different fr=
om the table definition.
 =20