hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Grover <mgro...@oanda.com>
Subject Re: When/how to use partitions and buckets usefully?
Date Tue, 24 Apr 2012 01:16:39 GMT
Hi Ruben,
Like Bejoy pointed out, members_map is small enough to fit in memory, so your joins with visit_stats
would be much faster with map-side join.

However, there is still some virtue in bucketing visit_stats. Bucketing can optimize joins,
group by's and potentially other queries in certain circumstances.
You probably want to keep consistent bucketing columns across all your tables so they can
leveraged in multi-table queries. Most people use some power of 2 as their number of buckets.
To make the best use of the buckets, each of your buckets should be able to entirely load
into memory on the node.

I use something close the formula below to calculate the number of buckets:

#buckets = (x * Average_partition_size) / JVM_memory_available_to_your_Hadoop_tasknode

I call x (>1) the "factor of conservatism". Higher x means you are being more conservative
by having larger number of buckets (and bearing the increased overhead), lower x means the
reverse. What x to use would depend on your use case. This is because the number of buckets
in a table is fixed. If you have a large partition, it would distribute it's data into bulkier
buckets and you would want to make sure these bulkier buckets can still fit in memory. Moreover,
buckets are generated using a hashing function, if you have a strong bias towards a particular
value of bucketing column in your data, some buckets might be bulkier than others. In that
case, you'd want to make sure that those bulkier buckets can still fit in memory.

To summarize, it depends on:
* How the actual partition sizes vary from the average partition size (i.e. the standard deviation
of your partition size). More standard deviations means you should be more conservative in
your calculation and vice-versa.
* Distribution of the data in the bucketing columns. "Wider" distribution means you should
be more conservative and vice-versa.

Long story short, I would say, x of 2 to 4 should suffice in most cases but feel free to verify
that in your case:-) I would love to hear what factors others have been using when calculating
their number of buckets, BTW!
Whatever answer you get for #buckets from above formula, use the closest power of 2 as the
number of buckets in your table (I am not sure if this is a must, though).

Good luck!

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgrover@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


----- Original Message -----
From: "Bejoy KS" <bejoy_ks@yahoo.com>
To: "Ruben de Vries" <ruben.devries@hyves.nl>, user@hive.apache.org
Sent: Monday, April 23, 2012 12:39:17 PM
Subject: Re: When/how to use partitions and buckets usefully?

If data is in hdfs, then you can bucket it only after loading into a temp/staging table and
then to the final bucketed table. Bucketing needs a Map reduce job. 


Regards 
Bejoy KS 

Sent from handheld, please excuse typos. 

From: Ruben de Vries <ruben.devries@hyves.nl> 
Date: Mon, 23 Apr 2012 18:13:20 +0200 
To: user@hive.apache.org<user@hive.apache.org>; bejoy_ks@yahoo.com<bejoy_ks@yahoo.com>

Subject: RE: When/how to use partitions and buckets usefully? 




Thanks for the help so far guys, 



I bucketed the members_map, it’s 330mb in size (11 mil records). 



Can you manually bucket stuff? 

Since my initial mapreduce job is still outside of Hive I’m doing a LOAD DATA to import
stuff into the visit_stats tables, replacing that with INSERT OVERWRITE SELECT slows it down
a lot 





From: Bejoy KS [mailto:bejoy_ks@yahoo.com] 
Sent: Monday, April 23, 2012 6:06 PM 
To: user@hive.apache.org 
Subject: Re: When/how to use partitions and buckets usefully? 



For Bucketed map join, both tables should be bucketed and the number of buckets of one should
be multiple of other. 


Regards 
Bejoy KS 

Sent from handheld, please excuse typos. 




From: "Bejoy KS" < bejoy_ks@yahoo.com > 


Date: Mon, 23 Apr 2012 16:03:34 +0000 


To: < user@hive.apache.org > 


ReplyTo: bejoy_ks@yahoo.com 


Subject: Re: When/how to use partitions and buckets usefully? 




Bucketed map join would be good I guess. What is the total size of the smaller table and what
is its expected size in the next few years? 

The size should be good enough to be put in Distributed Cache, then map side joins would offer
you much performance improvement. 


Regards 
Bejoy KS 

Sent from handheld, please excuse typos. 




From: Ruben de Vries < ruben.devries@hyves.nl > 


Date: Mon, 23 Apr 2012 17:38:20 +0200 


To: user@hive.apache.org<user@hive.apache.org > 


ReplyTo: user@hive.apache.org 


Subject: RE: When/how to use partitions and buckets usefully? 




Ok, very clear on the partitions, try to make them match the WHERE clauses, not so much about
group clauses then ;) 



The member_map contains 11.636.619 records atm, I think bucketing those would be good? 

What’s a good number to bucket them by then? 



And is there any point in bucketing the visit_stats? 





From: Tucker, Matt [mailto:Matt.Tucker@disney.com] 
Sent: Monday, April 23, 2012 5:30 PM 
To: user@hive.apache.org 
Subject: RE: When/how to use partitions and buckets usefully? 



If you’re only interested in a certain window of dates for analysis, a date-based partition
scheme will be helpful, as it will trim partitions that aren’t needed by the query before
execution. 



If the member_map table is small, you might consider testing the feasibility of map-side joins,
as it will reduce the number of processing stages. If member_map is large, bucketing on member_id
will avoid having as many rows from visit_stats compared to each member_id for joins. 




Matt Tucker 





From: Ruben de Vries [mailto:ruben.devries@hyves.nl] 
Sent: Monday, April 23, 2012 11:19 AM 
To: user@hive.apache.org 
Subject: When/how to use partitions and buckets usefully? 



It seems there’s enough information to be found on how to setup and use partitions and buckets.


But I’m more interested in how to figure out when and what columns you should be partitioning
and bucketing to increase performance?! 



In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols which give me
info about the visits) and 1 member_map (member_id, gender, age). 



Usually I group by date and then one of the other col so I assume that partitioning on date
is a good start?! 



It seems the join of the member_map onto the visit_stats makes the queries a lot slower, can
that be fixed by bucketing both tables? Or just one of them? 



Maybe some ppl have written good blogs on this subject but I can’t really seem to find them!?




Any help would be appreciated, thanks in advance J

Mime
View raw message