hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abhishek <abhishek.dod...@gmail.com>
Subject Re: Performance tuning in hive
Date Fri, 28 Sep 2012 15:14:56 GMT
Hi Bejoy,

How to use CTAS with Clustered By. 

I am getting following error when doing

Create table as select

CTAS does not support partitioning in the target table.

Regards
Abhi

Sent from my iPhone

On Sep 28, 2012, at 5:32 AM, Bejoy KS <bejoy_ks@yahoo.com> wrote:

> Hi Abshiek
> 
> Which optimization you have to choose totally depends o your queries or the kind of queries
fired on those tables. Based on that you need to bucket and index them to get better performance.
From a birds eye point of view, bucketing + indexing + map joins would be a good combination
if those suits your data set.
>  
> Regards,
> Bejoy KS
> 
> From: Abhishek <abhishek.dodda1@gmail.com>
> To: "user@hive.apache.org" <user@hive.apache.org> 
> Cc: "user@hive.apache.org" <user@hive.apache.org> 
> Sent: Friday, September 28, 2012 5:16 AM
> Subject: Re: Performance tuning in hive
> 
> Hi Bejoy,
> 
> Thanks for the reply.Can I know whether combination of
> 1) Indexing and Bucketing  
>        Or
> 2) bucketing with Rc file
>      Or
> 3) sequence file with bucketing and indexing
>    Or
> 4) map join with indexes 
>   Or
> 
> Any other combination of above mentioned or non mentioned, would fetch a better performance.
> 
> Regards
> Abhi
> 
> Sent from my iPhone
> 
> On Sep 27, 2012, at 2:44 PM, Bejoy KS <bejoy_ks@yahoo.com> wrote:
> 
>> Hi Abshiek
>> 
>> You can have a look at join optimizations as well as group by optimizations
>> 
>> Join optimization - Based on your data sets you can go in with map side join or bucketed
map join or
>> to enable map join -> set hive.auto.convert.join = true;
>> 
>> to enable bucketed map join ->  set hive.optimize.bucketmapjoin = true (    The
prerequisite here is both the tables should be bucketed on the join column.)
>> If the data in buckets are sorted then you can go in with a sort merge join as well,
you need to enable the following properties
>>  set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>>   set hive.optimize.bucketmapjoin = true;
>>   set hive.optimize.bucketmapjoin.sortedmerge = true;
>> 
>> For details you can refer the following url
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
>> 
>> Group By OPtimization - You can go ahead with a few group by optimizations as well.
A few pointers in here
>> http://mail-archives.apache.org/mod_mbox/hive-user/201209.mbox/%3CB55FF166-239E-4E39-BF92-3AE59EB78A27@gmail.com%3E
>> 
>> 
>> Hive Indexes - Join and Group by gets optimized better with buckets. Based on your
query you need to pre determine how your tables need to be bucketed. Indexing also gives you
great performance advantage over queries that involves group by and where. Join optimization
using indexes is in progress
>> https://issues.apache.org/jira/browse/HIVE-2845
>> 
>> 
>> RC file or Sequence File is a choice to be made based on the query patterns. If you
are querying only a few columns then RC files gives you a performance edge but if the queries
are spanned across pretty much all columns then use the more generalized Sequence Files.
>> 
>>  
>> Regards,
>> Bejoy KS
>> 
>> From: Abhishek <abhishek.dodda1@gmail.com>
>> To: Hive <user@hive.apache.org> 
>> Sent: Thursday, September 27, 2012 7:03 PM
>> Subject: Performance tuning in hive
>> 
>> Hi all,
>> 
>> I am trying to increase the performance of some queries in hive, all queries mostly
contain left outer join , group by and conditional checks, union all. I have over riden some
properities in hive shell 
>> 
>> Set io.sort.mb=512
>> Set io.sort.factor=100
>> Set mapred.child.jvm.opts=-Xmx2048mb
>> Set hive.map.aggr=true
>> Set hive.exec.parallel=true
>> Set mapred.tasks.reuse.num.tasks=-1
>> Set hive.mapred.map.speculative.execution=false
>> Set hive.mapred.reduce.speculative.execution=false
>> 
>> I got some performance gain.
>> 
>> Still want to improve the performance of these queries
>> 
>> Which of the following gives me better performance 
>> 
>> Rcfile
>> Indexing
>> Bucketing
>> Sequence file 
>> Combination of above
>> 
>> Or 
>> 
>> Some configuration parameter tuning
>> 
>> Which one from above yields good performance??
>> 
>> Thanks in advance.
>> 
>> Regards
>> Abhi
> 
> 

Mime
View raw message