Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by AMammenT:
http://wiki.apache.org/hadoop/Hive/LanguageManual/Sampling
The comment on the change is:
Clear up confusion around cluster vs. bucket and how they interact.
------------------------------------------------------------------------------
So in the above example, if table 'source' was created with 'CLUSTERED BY id INTO 32 BUCKETS'
{{{
- TABLESAMPLE(BUCKET 3 OUT OF 16)
+ TABLESAMPLE(BUCKET 3 OUT OF 16 ON id)
}}}
- would pick out the 3rd and 19th buckets.
+ would pick out the 3rd and 19th clusters as each bucket would be composed of (32/16)=2 clusters.
On the other hand the tablesample clause
{{{
TABLESAMPLE(BUCKET 3 OUT OF 64 ON id)
}}}
- would pick out half of the 3rd bucket.
+ would pick out half of the 3rd cluster as each bucket would be composed of (32/64)=1/2 of
a cluster.
|