I am experiencing similar behavior in my queries. All the conditions for
bucketed map join are met, and the only difference in execution when i set
the hive.optimize.bucketmapjoin flag to true, is that instead of a single
hash table, multiple hash tables are created. All the Hash Tables are still
created on the client side and loaded into tmp files, which are then
distributed to the mappers using distributed cache.
Can i find any example anywhere, which shows behavior of bucketed map join,
where in it does not create the has tables on the client itself? If so, is
there a flag for it?
Thanks,
Amit
On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <bejoy_ks@yahoo.com> wrote:
> Hi
> On a first look, it seems like map join is happening in your case
> other than bucketed map join. The following conditions need to hold for
> bucketed map join to work
> 1) Both the tables are bucketed on the join columns
> 2) The number of buckets in each table should be multiples of each other
> 3) Ensure that the table has enough number of buckets
>
> Note: If the data is large say 1TB(per table) and if you have just a few
> buckets say 100 buckets, each mapper may have to load 10GB>. This would
> definitely blow your jvm . Bottom line is ensure your mappers are not
> heavily loaded with the bucketed data distribution.
>
> Regards
> Bejoy.K.S
> 
> *From:* binhnt22 <Binhnt22@viettel.com.vn>
> *To:* user@hive.apache.org
> *Sent:* Saturday, March 31, 2012 6:46 AM
> *Subject:* Why BucketJoinMap consume too much memory
>
> I have 2 table, each has 6 million records and clustered into 10
> buckets
>
>
>
> These tables are very simple with 1 key column and 1 value column, all I
> want is getting the key that exists in both table but different value.
>
>
>
> The normal did the trick, took only 141 secs.
>
>
>
> select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b
> on (a.calling = b.calling) where a.total_volume <> b.total_volume;
>
>
>
> I tried to use bucket join map by setting: *set
> hive.optimize.bucketmapjoin = true*
>
>
>
> select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join
> ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where
> a.total_volume <> b.total_volume;
>
>
>
> 20120330 11:35:09 Starting to launch local task to process map
> join; maximum memory = 1398145024
>
> 20120330 11:35:12 Processing rows: 200000 Hashtable size:
> 199999 Memory usage: 86646704 rate: 0.062
>
> 20120330 11:35:15 Processing rows: 300000 Hashtable size:
> 299999 Memory usage: 128247464 rate: 0.092
>
> 20120330 11:35:18 Processing rows: 400000 Hashtable size:
> 399999 Memory usage: 174041744 rate: 0.124
>
> 20120330 11:35:21 Processing rows: 500000 Hashtable size:
> 499999 Memory usage: 214140840 rate: 0.153
>
> 20120330 11:35:25 Processing rows: 600000 Hashtable size:
> 599999 Memory usage: 255181504 rate: 0.183
>
> 20120330 11:35:29 Processing rows: 700000 Hashtable size:
> 699999 Memory usage: 296744320 rate: 0.212
>
> 20120330 11:35:35 Processing rows: 800000 Hashtable size:
> 799999 Memory usage: 342538616 rate: 0.245
>
> 20120330 11:35:38 Processing rows: 900000 Hashtable size:
> 899999 Memory usage: 384138552 rate: 0.275
>
> 20120330 11:35:45 Processing rows: 1000000 Hashtable size:
> 999999 Memory usage: 425719576 rate: 0.304
>
> 20120330 11:35:50 Processing rows: 1100000 Hashtable size:
> 1099999 Memory usage: 467319576 rate: 0.334
>
> 20120330 11:35:56 Processing rows: 1200000 Hashtable size:
> 1199999 Memory usage: 508940504 rate: 0.364
>
> 20120330 11:36:04 Processing rows: 1300000 Hashtable size:
> 1299999 Memory usage: 550521128 rate: 0.394
>
> 20120330 11:36:09 Processing rows: 1400000 Hashtable size:
> 1399999 Memory usage: 592121128 rate: 0.424
>
> 20120330 11:36:15 Processing rows: 1500000 Hashtable size:
> 1499999 Memory usage: 633720336 rate: 0.453
>
> 20120330 11:36:22 Processing rows: 1600000 Hashtable size:
> 1599999 Memory usage: 692097568 rate: 0.495
>
> 20120330 11:36:33 Processing rows: 1700000 Hashtable size:
> 1699999 Memory usage: 725308944 rate: 0.519
>
> 20120330 11:36:40 Processing rows: 1800000 Hashtable size:
> 1799999 Memory usage: 766946424 rate: 0.549
>
> 20120330 11:36:48 Processing rows: 1900000 Hashtable size:
> 1899999 Memory usage: 808527928 rate: 0.578
>
> 20120330 11:36:55 Processing rows: 2000000 Hashtable size:
> 1999999 Memory usage: 850127928 rate: 0.608
>
> 20120330 11:37:08 Processing rows: 2100000 Hashtable size:
> 2099999 Memory usage: 891708856 rate: 0.638
>
> 20120330 11:37:16 Processing rows: 2200000 Hashtable size:
> 2199999 Memory usage: 933308856 rate: 0.668
>
> 20120330 11:37:25 Processing rows: 2300000 Hashtable size:
> 2299999 Memory usage: 974908856 rate: 0.697
>
> 20120330 11:37:34 Processing rows: 2400000 Hashtable size:
> 2399999 Memory usage: 1016529448 rate: 0.727
>
> 20120330 11:37:43 Processing rows: 2500000 Hashtable size:
> 2499999 Memory usage: 1058129496 rate: 0.757
>
> 20120330 11:37:58 Processing rows: 2600000 Hashtable size:
> 2599999 Memory usage: 1099708832 rate: 0.787
>
> Exception in thread "Thread1" java.lang.OutOfMemoryError: Java heap space
>
>
>
> My system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them
> containts NameNode, JobTracker, Hive Server and all of them contain
> DataNode, TaskTracker
>
>
>
> In all node, I set: export HADOOP_HEAPSIZE=1500 in hadoopenv.sh (~ 1.3GB
> heap)
>
>
>
> I want to ask you experts, why bucket join map consume too much memory? Am
> I wrong or my configuration is bad?
>
>
>
> *Best regards,*
>
>
>
>
>
