hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gemini alex <gemini5201...@gmail.com>
Subject Re: Why BucketJoinMap consume too much memory
Date Fri, 06 Apr 2012 07:06:46 GMT
I guess the problem is you can't using <> predicate in bucket join, try to
select c.* from (

select /*+ MAPJOIN(b) */ a.calling calling ,a. total_volume atotal_volume ,
b.total_volume btotal_volume from ra_md_syn a join ra_ocs_syn b****

     on (a.calling = b.calling) ) c where c.atotal_volumn<>c.btotal_volume ;




在 2012年4月6日 上午9:19,binhnt22 <Binhnt22@viettel.com.vn>写道:

>  Hi Bejoy,****
>
> ** **
>
> Sorry for late response. I will start to demonstrate over again to clear
> some information.****
>
> ** **
>
> I have 2 tables, nearly same. Both has the same table structure, 65m
> records, 2GB size (same size).****
>
> hive> describe ra_md_syn;****
>
> OK****
>
> calling string****
>
> total_duration  bigint****
>
> total_volume    bigint****
>
> total_charge    bigint****
>
> ** **
>
> Both of them were bucketized into 256 buckets on ‘calling’ column (in the
> last time only 10 buckets, I tried to increase it as you suggested). And I
> want to find all ‘calling’ exists in both tables but different
> ‘total_volume’****
>
> The script as you knew:****
>
> ** **
>
> hive> set hive.optimize.bucketmapjoin = true;****
>
> hive> set hive.enforce.bucketing=true;****
>
> hive> set
> hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;****
>
> hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b****
>
>     > on (a.calling = b.calling) where  a.total_volume <> b.total_volume;*
> ***
>
> ** **
>
> And the result was exactly in my last email. Java heap space error. With
> total size is only 2GB and 256 buckets, I think bucket size is impossible
> to be the issue here.****
>
> ** **
>
> Please give me some advice, I really appreciate****
>
> *Best regards*
>
> Nguyen Thanh Binh (Mr)****
>
> Cell phone: (+84)98.226.0622****
>
> ** **
>
> *From:* Bejoy Ks [mailto:bejoy_ks@yahoo.com]
> *Sent:* Thursday, April 05, 2012 7:23 PM
>
> *To:* user@hive.apache.org
> *Subject:* Re: Why BucketJoinMap consume too much memory****
>
>  ** **
>
> Hi Binh****
>
> ** **
>
>     I was just checking your local map join log , and I noticed two things
> ****
>
> - the memory usage by one hash table has got beyond 1G. ****
>
> - Number of rows processed is just 2M****
>
> ** **
>
> It is possible that, Each bucket it self is too large to be loaded
> in memory.****
>
> ** **
>
> As a work around or to nail down the bucket size is the issue here, can
> you try increasing the number of buckets to 100 and try doing a bucketed
> map join.****
>
> ** **
>
> Also you mentioned the data size is 2Gb, is it the compressed data size?**
> **
>
> ** **
>
> 2012-04-05 10:41:07     Processing rows:        2,900,000 Hashtable size:
> 2899999 Memory usage:   1,062,065,576      rate:   0.76****
>
> ** **
>
> Regards****
>
> Bejoy KS****
>
> ** **
>
> ** **
>
> ** **
>    ------------------------------
>
> *From:* Nitin Pawar <nitinpawar432@gmail.com>
> *To:* user@hive.apache.org
> *Sent:* Thursday, April 5, 2012 5:03 PM
> *Subject:* Re: Why BucketJoinMap consume too much memory****
>
>
>
> ****
>
> Can you tell me the size of table b? ****
>
> ** **
>
> If you are doing bucketing and still size b table is huge then it will
> reach this problem****
>
> On Thu, Apr 5, 2012 at 4:22 PM, binhnt22 <Binhnt22@viettel.com.vn> wrote:*
> ***
>
> Thank Nitin,****
>
>  ****
>
> I tried but no luck. Here’s hive log, please spend a little time to view
> it.****
>
>  ****
>
> hive> set hive.optimize.bucketmapjoin = true;****
>
> hive> set hive.enforce.bucketing=true;****
>
> hive> set
> hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;****
>
> hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b****
>
>     > on (a.calling = b.calling) where  a.total_volume <> b.total_volume;*
> ***
>
> Total MapReduce jobs = 1****
>
> WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
> use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
> files.****
>
> Execution log at:
> /tmp/hduser/hduser_20120405103737_28ef26fe-a202-4047-b5ca-c40d9e3ad36c.log
> ****
>
> 2012-04-05 10:37:45     Starting to launch local task to process map
> join;      maximum memory = 1398145024****
>
> 2012-04-05 10:37:48     Processing rows:        200000  Hashtable size:
> 199999  Memory usage:   75403880        rate:   0.054****
>
> 2012-04-05 10:37:50     Processing rows:        300000  Hashtable size:
> 299999  Memory usage:   111404664       rate:   0.08****
>
> 2012-04-05 10:37:54     Processing rows:        400000  Hashtable size:
> 399999  Memory usage:   151598960       rate:   0.108****
>
> 2012-04-05 10:38:04     Processing rows:        500000  Hashtable size:
> 499999  Memory usage:   185483368       rate:   0.133****
>
> 2012-04-05 10:38:09     Processing rows:        600000  Hashtable size:
> 599999  Memory usage:   221483392       rate:   0.158****
>
> 2012-04-05 10:38:13     Processing rows:        700000  Hashtable size:
> 699999  Memory usage:   257482640       rate:   0.184****
>
> 2012-04-05 10:38:19     Processing rows:        800000  Hashtable size:
> 799999  Memory usage:   297676944       rate:   0.213****
>
> 2012-04-05 10:38:22     Processing rows:        900000  Hashtable size:
> 899999  Memory usage:   333676968       rate:   0.239****
>
> 2012-04-05 10:38:27     Processing rows:        1000000 Hashtable size:
> 999999  Memory usage:   369676944       rate:   0.264****
>
> 2012-04-05 10:38:31     Processing rows:        1100000 Hashtable size:
> 1099999 Memory usage:   405676968       rate:   0.29****
>
> 2012-04-05 10:38:36     Processing rows:        1200000 Hashtable size:
> 1199999 Memory usage:   441676944       rate:   0.316****
>
> 2012-04-05 10:38:42     Processing rows:        1300000 Hashtable size:
> 1299999 Memory usage:   477676944       rate:   0.342****
>
> 2012-04-05 10:38:47     Processing rows:        1400000 Hashtable size:
> 1399999 Memory usage:   513676968       rate:   0.367****
>
> 2012-04-05 10:38:52     Processing rows:        1500000 Hashtable size:
> 1499999 Memory usage:   549676944       rate:   0.393****
>
> 2012-04-05 10:39:00     Processing rows:        1600000 Hashtable size:
> 1599999 Memory usage:   602454200       rate:   0.431****
>
> 2012-04-05 10:39:08     Processing rows:        1700000 Hashtable size:
> 1699999 Memory usage:   630065552       rate:   0.451****
>
> 2012-04-05 10:39:14     Processing rows:        1800000 Hashtable size:
> 1799999 Memory usage:   666065552       rate:   0.476****
>
> 2012-04-05 10:39:20     Processing rows:        1900000 Hashtable size:
> 1899999 Memory usage:   702065552       rate:   0.502****
>
> 2012-04-05 10:39:26     Processing rows:        2000000 Hashtable size:
> 1999999 Memory usage:   738065576       rate:   0.528****
>
> 2012-04-05 10:39:36     Processing rows:        2100000 Hashtable size:
> 2099999 Memory usage:   774065552       rate:   0.554****
>
> 2012-04-05 10:39:43     Processing rows:        2200000 Hashtable size:
> 2199999 Memory usage:   810065552       rate:   0.579****
>
> 2012-04-05 10:39:51     Processing rows:        2300000 Hashtable size:
> 2299999 Memory usage:   846065576       rate:   0.605****
>
> 2012-04-05 10:40:16     Processing rows:        2400000 Hashtable size:
> 2399999 Memory usage:   882085136       rate:   0.631****
>
> 2012-04-05 10:40:24     Processing rows:        2500000 Hashtable size:
> 2499999 Memory usage:   918085208       rate:   0.657****
>
> 2012-04-05 10:40:39     Processing rows:        2600000 Hashtable size:
> 2599999 Memory usage:   954065544       rate:   0.682****
>
> 2012-04-05 10:40:48     Processing rows:        2700000 Hashtable size:
> 2699999 Memory usage:   990065568       rate:   0.708****
>
> 2012-04-05 10:40:56     Processing rows:        2800000 Hashtable size:
> 2799999 Memory usage:   1026065552      rate:   0.734****
>
> 2012-04-05 10:41:07     Processing rows:        2900000 Hashtable size:
> 2899999 Memory usage:   1062065576      rate:   0.76****
>
> Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
> ****
>
>  ****
>
> *Best regards*****
>
> Nguyen Thanh Binh (Mr)****
>
> Cell phone: (+84)98.226.0622****
>
>  ****
>
> *From:* Nitin Pawar [mailto:nitinpawar432@gmail.com]
> *Sent:* Thursday, April 05, 2012 5:36 PM****
>
>
> *To:* user@hive.apache.org
> *Subject:* Re: Why BucketJoinMap consume too much memory****
>
>  ****
>
> can you try adding these settings ****
>
> set hive.enforce.bucketing=true;****
>
> hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;****
>
>  ****
>
> I have tried bucketing with 1000 buckets and with more than 1TB data
> tables .. they do go through fine ****
>
>  ****
>
>  ****
>
> On Thu, Apr 5, 2012 at 3:37 PM, binhnt22 <Binhnt22@viettel.com.vn> wrote:*
> ***
>
> Hi Bejoy,****
>
>  ****
>
> Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) and bucketized on
> ‘calling’ column into 10 buckets.****
>
>  ****
>
> As you said, hive will load only 1 bucket ~ 180-190MB into memory. That’s
> hardly to blow the heap (1.3GB)****
>
>  ****
>
> According to wiki, I set:****
>
>  ****
>
>   set
> hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;*
> ***
>
>   set hive.optimize.bucketmapjoin = true;****
>
>   set hive.optimize.bucketmapjoin.sortedmerge = true;****
>
>  ****
>
> And run the following SQL****
>
>  ****
>
> select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join
> ra_ocs_cdr_ggsn_synthetic b ****
>
> on (a.calling = b.calling) where  a.total_volume <> b.total_volume;****
>
>  ****
>
> But it still created many hash tables then threw Java Heap space error****
>
>  ****
>
> *Best regards*****
>
> Nguyen Thanh Binh (Mr)****
>
> Cell phone: (+84)98.226.0622****
>
>  ****
>
> *From:* Bejoy Ks [mailto:bejoy_ks@yahoo.com]
> *Sent:* Thursday, April 05, 2012 3:07 PM
> *To:* user@hive.apache.org****
>
>
> *Subject:* Re: Why BucketJoinMap consume too much memory****
>
>  ****
>
> Hi Amit****
>
>  ****
>
>       Sorry for the delayed response, had a terrible schedule. AFAIK,
> there is no flags that would help you to take the hash table creation,
> compression and load into tmp files away from client node. ****
>
>       From my understanding if you use a Map side join, the small table as
> a whole is converted into a hash table and compressed in a tmp file. Say if
> your child jvm size is 1gb and this small table is 5GB, it'd blow off jour
> job if the map tasks tries to get such a huge file in memory. Bucketed map
> join can help here, if the table is bucketed ,say 100 buckets then each
> bucket may have around 50mb of data. ie one tmp file would be just less
> that 50mb, here mapper needs to load only the required buckets
> in memory and thus hardly run into memory issues.****
>
>     Also on the client, The records are processed bucket by bucket and
> loaded into tmp files. So if your bucket size is too large, than the heap
> size specified for your client, it'd throw an out of memory.****
>
>  ****
>
> Regards****
>
> Bejoy KS****
>
>  ****
>    ------------------------------
>
> *From:* Amit Sharma <amitsharma1708@gmail.com>
> *To:* user@hive.apache.org; Bejoy Ks <bejoy_ks@yahoo.com>
> *Sent:* Tuesday, April 3, 2012 11:06 PM
> *Subject:* Re: Why BucketJoinMap consume too much memory****
>
>  ****
>
> I am experiencing similar behavior in my queries. All the conditions for
> bucketed map join are met, and the only difference in execution when i set
> the hive.optimize.bucketmapjoin flag to true, is that instead of a single
> hash table, multiple hash tables are created. All the Hash Tables are still
> created on the client side and loaded into tmp files, which are then
> distributed to the mappers using distributed cache.
>
> Can i find any example anywhere, which shows behavior of bucketed map
> join, where in it does not create the has tables on the client itself? If
> so, is there a flag for it?
>
> Thanks,
> Amit****
>
> On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <bejoy_ks@yahoo.com> wrote:****
>
> Hi
>     On a first look, it seems like map join is happening in your case
> other than bucketed map join. The following conditions need to hold for
> bucketed map join to work
> 1) Both the tables are bucketed on the join columns
> 2) The number of buckets in each table should be multiples of each other
> 3) Ensure that the table has enough number of buckets
>
> Note: If the data is large say 1TB(per table) and if you have just a few
> buckets say 100 buckets, each mapper may have to load 10GB>. This would
> definitely blow your jvm . Bottom line is ensure your mappers are not
> heavily loaded with the bucketed data distribution.
>
> Regards
> Bejoy.K.S****
>    ------------------------------
>
> *From:* binhnt22 <Binhnt22@viettel.com.vn>
> *To:* user@hive.apache.org
> *Sent:* Saturday, March 31, 2012 6:46 AM
> *Subject:* Why BucketJoinMap consume too much memory****
>
>  ****
>
> I  have 2 table, each has 6 million records and clustered into 10 buckets*
> ***
>
>  ****
>
> These tables are very simple with 1 key column and 1 value column, all I
> want is getting the key that exists in both table but different value.****
>
>  ****
>
> The normal did the trick, took only 141 secs.****
>
>  ****
>
> select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b
> on (a.calling = b.calling) where  a.total_volume <> b.total_volume;****
>
>  ****
>
> I tried to use bucket join map by setting:   *set
> hive.optimize.bucketmapjoin = true*****
>
>  ****
>
> select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join
> ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where
> a.total_volume <> b.total_volume;****
>
>  ****
>
> 2012-03-30 11:35:09     Starting to launch local task to process map
> join;      maximum memory = 1398145024****
>
> 2012-03-30 11:35:12     Processing rows:        200000  Hashtable size:
> 199999  Memory usage:   86646704        rate:   0.062****
>
> 2012-03-30 11:35:15     Processing rows:        300000  Hashtable size:
> 299999  Memory usage:   128247464       rate:   0.092****
>
> 2012-03-30 11:35:18     Processing rows:        400000  Hashtable size:
> 399999  Memory usage:   174041744       rate:   0.124****
>
> 2012-03-30 11:35:21     Processing rows:        500000  Hashtable size:
> 499999  Memory usage:   214140840       rate:   0.153****
>
> 2012-03-30 11:35:25     Processing rows:        600000  Hashtable size:
> 599999  Memory usage:   255181504       rate:   0.183****
>
> 2012-03-30 11:35:29     Processing rows:        700000  Hashtable size:
> 699999  Memory usage:   296744320       rate:   0.212****
>
> 2012-03-30 11:35:35     Processing rows:        800000  Hashtable size:
> 799999  Memory usage:   342538616       rate:   0.245****
>
> 2012-03-30 11:35:38     Processing rows:        900000  Hashtable size:
> 899999  Memory usage:   384138552       rate:   0.275****
>
> 2012-03-30 11:35:45     Processing rows:        1000000 Hashtable size:
> 999999  Memory usage:   425719576       rate:   0.304****
>
> 2012-03-30 11:35:50     Processing rows:        1100000 Hashtable size:
> 1099999 Memory usage:   467319576       rate:   0.334****
>
> 2012-03-30 11:35:56     Processing rows:        1200000 Hashtable size:
> 1199999 Memory usage:   508940504       rate:   0.364****
>
> 2012-03-30 11:36:04     Processing rows:        1300000 Hashtable size:
> 1299999 Memory usage:   550521128       rate:   0.394****
>
> 2012-03-30 11:36:09     Processing rows:        1400000 Hashtable size:
> 1399999 Memory usage:   592121128       rate:   0.424****
>
> 2012-03-30 11:36:15     Processing rows:        1500000 Hashtable size:
> 1499999 Memory usage:   633720336       rate:   0.453****
>
> 2012-03-30 11:36:22     Processing rows:        1600000 Hashtable size:
> 1599999 Memory usage:   692097568       rate:   0.495****
>
> 2012-03-30 11:36:33     Processing rows:        1700000 Hashtable size:
> 1699999 Memory usage:   725308944       rate:   0.519****
>
> 2012-03-30 11:36:40     Processing rows:        1800000 Hashtable size:
> 1799999 Memory usage:   766946424       rate:   0.549****
>
> 2012-03-30 11:36:48     Processing rows:        1900000 Hashtable size:
> 1899999 Memory usage:   808527928       rate:   0.578****
>
> 2012-03-30 11:36:55     Processing rows:        2000000 Hashtable size:
> 1999999 Memory usage:   850127928       rate:   0.608****
>
> 2012-03-30 11:37:08     Processing rows:        2100000 Hashtable size:
> 2099999 Memory usage:   891708856       rate:   0.638****
>
> 2012-03-30 11:37:16     Processing rows:        2200000 Hashtable size:
> 2199999 Memory usage:   933308856       rate:   0.668****
>
> 2012-03-30 11:37:25     Processing rows:        2300000 Hashtable size:
> 2299999 Memory usage:   974908856       rate:   0.697****
>
> 2012-03-30 11:37:34     Processing rows:        2400000 Hashtable size:
> 2399999 Memory usage:   1016529448      rate:   0.727****
>
> 2012-03-30 11:37:43     Processing rows:        2500000 Hashtable size:
> 2499999 Memory usage:   1058129496      rate:   0.757****
>
> 2012-03-30 11:37:58     Processing rows:        2600000 Hashtable size:
> 2599999 Memory usage:   1099708832      rate:   0.787****
>
> Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
> ****
>
>  ****
>
> My system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them
> containts NameNode, JobTracker, Hive Server and all of them contain
> DataNode, TaskTracker****
>
>  ****
>
> In all node, I set: export HADOOP_HEAPSIZE=1500 in hadoop-env.sh (~ 1.3GB
> heap)****
>
>  ****
>
> I want to ask you experts, why bucket join map consume too much memory? Am
> I wrong or my configuration is bad?****
>
>  ****
>
> *Best regards,*****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>
>
> ****
>
>  ****
>
> --
> Nitin Pawar****
>
>
>
> ****
>
> ** **
>
> --
> Nitin Pawar****
>
> ** **
>

Mime
View raw message