hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "Hive/LanguageManual/Sampling" by RaghothamMurthy
Date Wed, 21 Jan 2009 23:26:30 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by RaghothamMurthy:
http://wiki.apache.org/hadoop/Hive/LanguageManual/Sampling

New page:
Syntax:
{{{
table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])
}}}

The TABLESAMPLE clause allows the users to write queries for samples of the data instead of
the whole table. The TABLESAMPLE clause can be added to any table in the FROM clause. The
buckets are numbered starting from 0. '''colname''' indicates the column on which to sample
each row in the table. colname can be one of the non-partition columns in the table or '''rand()'''
indicating sampling on the entire row instead of an individual column. The rows of the table
are 'bucketed' on the colname randomly into y buckets numbered 0 through y. Rows which belong
to bucket x are returned.  

In the following example the 3rd bucket out of the 32 buckets of the table source. 's' is
the table alias.
{{{
SELECT * 
FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s; 
}}}

'''Input pruning''': Typically, TABLESAMPLE will scan the entire table and fetch the sample.
But, that is not very efficient. Instead, the table can be  created with a CLUSTERED BY clause
which indicates the set of columns on which the table is hash-partitioned/clustered on. If
the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause,
TABLESAMPLE scans only the required hash-partitions of the table.

Example:

So in the above example, if table 'source' was created with 'CLUSTERED BY id INTO 32 BUCKETS'

{{{
    TABLESAMPLE(BUCKET 3 OUT OF 16) 
}}}
would pick out the 3rd and 19th buckets. 

On the other hand the tablesample clause
{{{
    TABLESAMPLE(BUCKET 3 OUT OF 64 ON id) 
}}}
would pick out half of the 3rd bucket. 

Mime
View raw message