hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ladda, Anand" <lan...@microstrategy.com>
Subject Block Sampling
Date Fri, 15 Jun 2012 18:59:18 GMT
Has the block sampling feature been added to one of the latest (Hive 0.8 or Hive 0.9) releases.
The wiki has the blurb below on block sampling
Block Sampling
It is a feature that is still on trunk and is not yet in any release version.
block_sample: TABLESAMPLE (n PERCENT)
This will allow Hive to pick up at least n% data size (notice it doesn't necessarily mean
number of rows) as inputs. Only CombineHiveInputFormat is supported and some special compression
formats are not handled. If we fail to sample it, the input of MapReduce job will be the whole
table/partition. We do it in HDFS block level so that the sampling granularity is block size.
For example, if block size is 256MB, even if n% of input size is only 100MB, you get 256MB
of data.
In the following example the input size 0.1% or more will be used for the query.
SELECT *
FROM source TABLESAMPLE(0.1 PERCENT) s;
Sometimes you want to sample the same data with different blocks, you can change this seed
number:
set hive.sample.seednumber=<INTEGER>;


Mime
View raw message