hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <mdefoinplatel....@orange.com>
Subject Optimization on bucketized/sorted tables
Date Tue, 20 Mar 2012 14:19:41 GMT
Hi folks,
I have several questions about optimization in Hive, they are mainly related to bucketized/sorted
tables.

Let say I have a table T bucketized on user_id and sorted by user_id, time.

CREATE TABLE T
( user_id BIGINT,
  time INT
)
CLUSTERED BY(user_id) SORTED BY(user_id, time) INTO 64 BUCKETS;


In a general way, I wonder which of the following operations will benefit from the fact that
 T is bucketized and sorted.


1)      Group by
SELECT user_id, count(time) FROM T GROUP BY user_id;


2)      Distribute by
SELECT user_id, time FROM T DISTRIBUTE BY user_id;


3)      Distribute by, Sort by
SELECT user_id, time FROM T DISTRIBUTE BY user_id SORT BY user_id, time;



4)      Insert into a bucketized/sorted table

CREATE TABLE T2
( user_id BIGINT,
  time INT
)
CLUSTERED BY(user_id) SORTED BY(user_id, time) INTO 64 BUCKETS;

set hive.enforce.bucketing = true;

INSERT OVERWRITE TABLE T2 SELECT T.user_id, T.time FROM T;


Finally, on a slightly more specific  topic...

Let say I want to perform the 'sessionization' on the table T  and I am planning to call a
python script to do that job.
To get a valid answer I must ensure that the data are sorted by user_id,time and that all
the data for a given user_id  are processed by a single call to my script.

I am planning to run the following query:
FROM (SELECT user_Id, time FROM T  DISTRIBUTE BY user_id SORT BY user_id, time) s SELECT TRANSFORM
(s.user_id, s.time)  USING 'python session.py'  AS user_id, avg_session, nb_session;

So I wonder first if this is the correct  approach and second if the 'DISTRIBUTE BY user_id
SORT BY user_id, time' clauses are required knowing that T is already bucketized and sorted
on the right columns.

Many thanks in advance for your help,
Michael


_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees
et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par
erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant
susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou
falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may
be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message
and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages that have been
modified, changed or falsified.
Thank you.


Mime
View raw message