hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Avrilia Floratou" <flora...@cs.wisc.edu>
Subject Problem with query on bucketed table
Date Mon, 10 Oct 2011 01:43:50 GMT
Hello,

I have a question regarding the execution of some queries on bucketed tables.

I've created a compressed bucketed table using the following statement:

create external table partRC (P_PARTKEY BIGINT,P_NAME STRING, P_MFGR
STRING, P_BRAND STRING, P_TYPE STRING, P_SIZE INT, P_CONTAINER STRING,
P_RETAILPRICE DOUBLE, P_COMMENT STRING)
CLUSTERED BY (P_PARTKEY) SORTED BY (P_PARTKEY) INTO 512 BUCKETS
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerde'
stored as RCFILE
LOCATION '/tpch/partRC';

I inserted all the data by reading from another table (PART) stored as
textfile using the following script:

set hive.enforce.bucketing = true;
set hive.enforce.sorting = true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.exec.compress.output=true;


INSERT OVERWRITE TABLE partRC
select * from part
cluster by (P_PARTKEY);

I verified that 512 buckets where cretaed (512 reducers where used) and
then I ran the following selection:

select * from partRC where P_PARTKEY = 3;

I noticed that the whole table is scanned. Is this expected? I was
expecting that only the bucket that contains the value 3 would be scanned
(and actually not the whole bucket -- since it is sorted). Am I doing sth
wrong here?

Thanks,
Avrilia


Mime
View raw message