hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy Ks <bejoy...@yahoo.com>
Subject Re: Problem with query on bucketed table
Date Mon, 10 Oct 2011 04:39:52 GMT
Hi Avrilia,
           In your SELECT query you need to append the following, *** TABLESAMPLE(
BUCKET x OUT OF y);  It specifies from which all buckets to scan and to get the data from.
Only in the case of partitions if you issue a select query (on a partition) without any hint
only that partition would be scanned. In case of SAMPLING/BUCKETS you need to specify hint
for the query on which bucket to do the scan. You can adjust the value of 'x' and 'y' to scan
more than one bucket in a query. 
       Just a suggestion from my end,from my understanding of the requirement you put
forth, i feel PARTITIONS would be better than SAMPLING for you as the query is centered around
on some particular value for a column. SAMPLING is good when you like to do some operations
on a small sample of whole data, say you need an approximate average of the whole data then
you can use sampling to avoid scanning the whole table. In Partitions use Dynamic Partitions
to load data from the source table into the target table on partitions on the fly.


Hope it helps!..

Regards
Bejoy.K.S



________________________________
From: Avrilia Floratou <floratou@cs.wisc.edu>
To: user@hive.apache.org
Sent: Monday, October 10, 2011 7:13 AM
Subject: Problem with query on bucketed table

Hello,

I have a question regarding the execution of some queries on bucketed tables.

I've created a compressed bucketed table using the following statement:

create external table partRC (P_PARTKEY BIGINT,P_NAME STRING, P_MFGR
STRING, P_BRAND STRING, P_TYPE STRING, P_SIZE INT, P_CONTAINER STRING,
P_RETAILPRICE DOUBLE, P_COMMENT STRING)
CLUSTERED BY (P_PARTKEY) SORTED BY (P_PARTKEY) INTO 512 BUCKETS
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerde'
stored as RCFILE
LOCATION '/tpch/partRC';

I inserted all the data by reading from another table (PART) stored as
textfile using the following script:

set hive.enforce.bucketing = true;
set hive.enforce.sorting = true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.exec.compress.output=true;


INSERT OVERWRITE TABLE partRC
select * from part
cluster by (P_PARTKEY);

I verified that 512 buckets where cretaed (512 reducers where used) and
then I ran the following selection:

select * from partRC where P_PARTKEY = 3;

I noticed that the whole table is scanned. Is this expected? I was
expecting that only the bucket that contains the value 3 would be scanned
(and actually not the whole bucket -- since it is sorted). Am I doing sth
wrong here?

Thanks,
Avrilia
Mime
View raw message