hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "BucketedTables" by AlexSmith
Date Mon, 16 Mar 2009 06:03:18 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by AlexSmith:
http://wiki.apache.org/hadoop/BucketedTables

The comment on the change is:
Initial edit.

New page:
This is a brief example on creating a populating bucketed tables.  Bucketed tables are fantastic
in that they allow much more efficient sampling than do non-bucketed tables, and they may
later allow for time saving operations such as mapside joins.  However, the bucketing specified
at table creation is not enforced when the table is written to, and so it is possible for
the table's metadata to advertise properties which are not upheld by the tables actual layout.
 This should obviously be avoided.  Here's how to do it right.

First there’s table creation:
{{{
CREATE TABLE user_info_bucketed(userid BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(ds STRING)
CLUSTERED BY(userid) INTO 256 BUCKETS;
}}}

Then we populate this, making sure to use 256 reducers:
{{{
set mapred.reduce.tasks = 256;    
FROM (
    FROM user_info u
    SELECT CAST(userid AS BIGINT) AS userid_bigint,
    WHERE d.ds='2009-02-25'
    CLUSTER BY userid_bigint
    ) c
INSERT OVERWRITE TABLE user_info_bucketed
PARTITION (ds='2009-02-25')
SELECT *;
}}}

Note that I’m clustering by the integer version of userid.  This might otherwise cluster
by userid as a STRING (depending on the type of userid in user_info), which uses a totally
different hash.  It's important for the hashing function to be of the correct data type, since
otherwise we'll expect userids in bucket 1 to satisfy (big_hash(userid) mod 256 == 0), but
instead we'll be getting (string_hash(userid) mod 256 == 0).  It's also good form to have
all of your tables use the same type (eg, BIGINT instead of STRING) since that way your sampling
from multiple tables will give you the same userids, letting join efficiently sample and join.

Mime
View raw message