hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Grover <mgro...@oanda.com>
Subject Re: When/how to use partitions and buckets usefully?
Date Thu, 26 Apr 2012 00:59:24 GMT
I am hoping that other people who have used Map Join can pitch in here...

When the smaller table gets loaded into mapper's memory, the data is loaded in its uncompressed
form, right?

If so, is it possible at all in this case that the compressed size of smaller table is less
than the memory available but the uncompressed size isn't?

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgrover@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


----- Original Message -----
From: "Ruben de Vries" <ruben.devries@hyves.nl>
To: user@hive.apache.org
Sent: Wednesday, April 25, 2012 3:48:46 AM
Subject: RE: When/how to use partitions and buckets usefully?

I already tried running with that set to 400mb, but it didn’t work and that setting is only
used when it’s trying to automatically figure out if it should be doing a mapjoin, while
I’m forcing it to do a mapjoin with a hint

From: gemini alex [mailto:gemini5201314@gmail.com]
Sent: Wednesday, April 25, 2012 9:40 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

there should be documented in wiki on LanguageManual+Joins .
在 2012年4月25日 下午3:36,gemini alex <gemini5201314@gmail.com>写道:
it's seemed you use the default hive configuration, the default map join will have only 25M
for small table,  copy your hive-default.xml to hive-site.xml and set hive.mapjoin.smalltable.filesize=300000000
在 2012年4月25日 上午12:09,Ruben de Vries <ruben.devries@hyves.nl>写道:

I got the (rather big) log here in a github gist: https://gist.github.com/2480893
And I also attached the plan.xml it was using to the gist.

When loading the members_map (11mil records, 320mb, 30b per record), it seems to take about
198b per record in the members_map, resulting in crashing around 7mil records with 1.4gb loaded.

The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) where
 - birthday is a string containing YYYY-MM-DD
 - gender is a tinyint, 1 2 or 3
 - member_id is int with the highest member_id being 14343249 (14mil)

The log says:
"INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string birthdate}"

I also tried doing the same thing but with an empty visit_stats table, with the same effect
Some of the blogs I read talk about 25mb small table, not 300mb like mine ...

Anyone can make anything out of this?
I'd rather go with this if at all possible,
otherwise I have to go the hard way and migrate all the visit_stats into buckets so they can
match the members_map on that?

-----Original Message-----
From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]
Sent: Tuesday, April 24, 2012 3:58 PM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

Hi Ruben
      The operation you are seeing in your log is preparation of hash table of the smaller
table, This hash table file is compressed and loaded into Distributed Cache and from there
it is used for map side joins. From your console log the hash table size/data size has gone
to nearly 1.5 GB, the data is large to be loaded into memory of the hive client.

2012-04-24 10:31:02     Processing rows:        7000000 Hashtable size: 6999999 Memory usage:
  1,468,378,760      rate:   0.788


Can you enable debug logging and post in the console to get a better picture why it consumes
this much memory.
Start your hive shell as
hive -hiveconf hive.root.logger=ALL,console;


Regards
Bejoy KS



________________________________
From: Ruben de Vries <ruben.devries@hyves.nl>
To: "user@hive.apache.org" <user@hive.apache.org>
Sent: Tuesday, April 24, 2012 4:58 PM
Subject: FW: When/how to use partitions and buckets usefully?

Here are both tables:

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map
           1            1          247231757 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats
         442          441         1091837835 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats

The 'work' I'm seeing on console is the loading of the table into memory?

It seems like it's loading the visit_stats table instead ?!
I tried doing MAPJOIN(visit_stats) but it fails non existing class (my JSONSerde) .


From: Nitin Pawar [mailto:nitinpawar432@gmail.com]
Sent: Tuesday, April 24, 2012 11:46 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

This operation is erroring out on the hive client itself before starting a map so splitting
to mappers is out of question.

can you do a dfs count for the members_map table hdfslocation and tell us the result?

On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <ruben.devries@hyves.nl> wrote:
Hmm I must be doing something wrong,  the members_map table is 300ish MB.
When I execute the following query:

SELECT
   /*+ MAPJOIN(members_map) */
   date_int,
   members_map.gender AS gender,
   'generic',
   COUNT( memberId ) AS unique,
   SUM( `generic`['count'] ) AS count,
   SUM( `generic`['seconds'] ) AS seconds
FROM visit_stats
JOIN members_map ON(members_map.member_id = visit_stats.memberId)
GROUP BY date_int, members_map.gender

It results in:
2012-04-24 10:25:59     Starting to launch local task to process map join;      maximum memory
= 1864171520
2012-04-24 10:26:00     Processing rows:        200000          Hashtable size: 199999   
      Memory usage:   43501848        rate:   0.023
2012-04-24 10:30:54     Processing rows:        6900000 Hashtable size: 6899999 Memory usage:
  1449867552      rate:   0.778
2012-04-24 10:31:02     Processing rows:        7000000 Hashtable size: 6999999 Memory usage:
  1468378760      rate:   0.788
Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space


I'm running it only my local, single node, dev env, could that be a problem since it won't
split over multiple mappers in this case?


-----Original Message-----
From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]
Sent: Tuesday, April 24, 2012 9:47 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

Hi Ruben
Map join hint is provided to hive using "MAPJOIN" keyword as :
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key

To use map side join some hive configuration properties needs to be enabled

For plain map side joins
hive>SET hive.auto.convert.join=true;
Latest versions of hive does a map join on the smaller table even if no map join hit is provided.

For bucketed map joins
hive>SET hive.optimize.bucketmapjoin=true

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins


Regards
Bejoy


________________________________
From: Nitin Pawar <nitinpawar432@gmail.com>
To: user@hive.apache.org
Sent: Tuesday, April 24, 2012 12:46 PM
Subject: Re: When/how to use partitions and buckets usefully?

If you are doing a map side join make sure the table members_map is small enough to hold in
memory

On 4/24/12, Ruben de Vries <ruben.devries@hyves.nl> wrote:
> Wow thanks everyone for the nice feedback!
>
> I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right?
>
>
> Cheers,
>
> Ruben de Vries
>
> -----Original Message-----
> From: Mark Grover [mailto:mgrover@oanda.com]
> Sent: Tuesday, April 24, 2012 3:17 AM
> To: user@hive.apache.org; bejoy ks
> Cc: Ruben de Vries
> Subject: Re: When/how to use partitions and buckets usefully?
>
> Hi Ruben,
> Like Bejoy pointed out, members_map is small enough to fit in memory,
> so your joins with visit_stats would be much faster with map-side join.
>
> However, there is still some virtue in bucketing visit_stats.
> Bucketing can optimize joins, group by's and potentially other queries
> in certain circumstances.
> You probably want to keep consistent bucketing columns across all your
> tables so they can leveraged in multi-table queries. Most people use
> some power of 2 as their number of buckets. To make the best use of
> the buckets, each of your buckets should be able to entirely load into
> memory on the node.
>
> I use something close the formula below to calculate the number of buckets:
>
> #buckets = (x * Average_partition_size) /
> JVM_memory_available_to_your_Hadoop_tasknode
>
> I call x (>1) the "factor of conservatism". Higher x means you are
> being more conservative by having larger number of buckets (and
> bearing the increased overhead), lower x means the reverse. What x to
> use would depend on your use case. This is because the number of buckets in a table is
fixed.
> If you have a large partition, it would distribute it's data into
> bulkier buckets and you would want to make sure these bulkier buckets
> can still fit in memory. Moreover, buckets are generated using a
> hashing function, if you have a strong bias towards a particular value
> of bucketing column in your data, some buckets might be bulkier than
> others. In that case, you'd want to make sure that those bulkier buckets can still fit
in memory.
>
> To summarize, it depends on:
> * How the actual partition sizes vary from the average partition size (i.e.
> the standard deviation of your partition size). More standard
> deviations means you should be more conservative in your calculation and vice-versa.
> * Distribution of the data in the bucketing columns. "Wider"
> distribution means you should be more conservative and vice-versa.
>
> Long story short, I would say, x of 2 to 4 should suffice in most
> cases but feel free to verify that in your case:-) I would love to
> hear what factors others have been using when calculating their number of buckets, BTW!
> Whatever answer you get for #buckets from above formula, use the
> closest power of 2 as the number of buckets in your table (I am not
> sure if this is a must, though).
>
> Good luck!
>
> Mark
>
> Mark Grover, Business Intelligence Analyst OANDA Corporation
>
> www: oanda.com www: fxtrade.com
> e: mgrover@oanda.com
>
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>
>
> ----- Original Message -----
> From: "Bejoy KS" <bejoy_ks@yahoo.com>
> To: "Ruben de Vries" <ruben.devries@hyves.nl>, user@hive.apache.org
> Sent: Monday, April 23, 2012 12:39:17 PM
> Subject: Re: When/how to use partitions and buckets usefully?
>
> If data is in hdfs, then you can bucket it only after loading into a
> temp/staging table and then to the final bucketed table. Bucketing
> needs a Map reduce job.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> From: Ruben de Vries <ruben.devries@hyves.nl>
> Date: Mon, 23 Apr 2012 18:13:20 +0200
> To: user@hive.apache.org<user@hive.apache.org>;
> bejoy_ks@yahoo.com<bejoy_ks@yahoo.com>
> Subject: RE: When/how to use partitions and buckets usefully?
>
>
>
>
> Thanks for the help so far guys,
>
>
>
> I bucketed the members_map, it's 330mb in size (11 mil records).
>
>
>
> Can you manually bucket stuff?
>
> Since my initial mapreduce job is still outside of Hive I'm doing a
> LOAD DATA to import stuff into the visit_stats tables, replacing that
> with INSERT OVERWRITE SELECT slows it down a lot
>
>
>
>
>
> From: Bejoy KS [mailto:bejoy_ks@yahoo.com]
> Sent: Monday, April 23, 2012 6:06 PM
> To: user@hive.apache.org
> Subject: Re: When/how to use partitions and buckets usefully?
>
>
>
> For Bucketed map join, both tables should be bucketed and the number
> of buckets of one should be multiple of other.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
>
>
>
> From: "Bejoy KS" < bejoy_ks@yahoo.com >
>
>
> Date: Mon, 23 Apr 2012 16:03:34 +0000
>
>
> To: < user@hive.apache.org >
>
>
> ReplyTo: bejoy_ks@yahoo.com
>
>
> Subject: Re: When/how to use partitions and buckets usefully?
>
>
>
>
> Bucketed map join would be good I guess. What is the total size of the
> smaller table and what is its expected size in the next few years?
>
> The size should be good enough to be put in Distributed Cache, then
> map side joins would offer you much performance improvement.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
>
>
>
> From: Ruben de Vries < ruben.devries@hyves.nl >
>
>
> Date: Mon, 23 Apr 2012 17:38:20 +0200
>
>
> To: user@hive.apache.org<user@hive.apache.org >
>
>
> ReplyTo: user@hive.apache.org
>
>
> Subject: RE: When/how to use partitions and buckets usefully?
>
>
>
>
> Ok, very clear on the partitions, try to make them match the WHERE
> clauses, not so much about group clauses then ;)
>
>
>
> The member_map contains 11.636.619 records atm, I think bucketing
> those would be good?
>
> What's a good number to bucket them by then?
>
>
>
> And is there any point in bucketing the visit_stats?
>
>
>
>
>
> From: Tucker, Matt [mailto:Matt.Tucker@disney.com]
> Sent: Monday, April 23, 2012 5:30 PM
> To: user@hive.apache.org
> Subject: RE: When/how to use partitions and buckets usefully?
>
>
>
> If you're only interested in a certain window of dates for analysis, a
> date-based partition scheme will be helpful, as it will trim
> partitions that aren't needed by the query before execution.
>
>
>
> If the member_map table is small, you might consider testing the
> feasibility of map-side joins, as it will reduce the number of
> processing stages. If member_map is large, bucketing on member_id will
> avoid having as many rows from visit_stats compared to each member_id for joins.
>
>
>
>
> Matt Tucker
>
>
>
>
>
> From: Ruben de Vries [mailto:ruben.devries@hyves.nl]
> Sent: Monday, April 23, 2012 11:19 AM
> To: user@hive.apache.org
> Subject: When/how to use partitions and buckets usefully?
>
>
>
> It seems there's enough information to be found on how to setup and
> use partitions and buckets.
>
> But I'm more interested in how to figure out when and what columns you
> should be partitioning and bucketing to increase performance?!
>
>
>
> In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP
> cols which give me info about the visits) and 1 member_map (member_id,
> gender, age).
>
>
>
> Usually I group by date and then one of the other col so I assume that
> partitioning on date is a good start?!
>
>
>
> It seems the join of the member_map onto the visit_stats makes the
> queries a lot slower, can that be fixed by bucketing both tables? Or just one of them?
>
>
>
>
> Maybe some ppl have written good blogs on this subject but I can't
> really seem to find them!?
>
>
>
> Any help would be appreciated, thanks in advance J
>


--
Nitin Pawar




--
Nitin Pawar



Mime
View raw message