hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy Ks <bejoy...@yahoo.com>
Subject Re: Why BucketJoinMap consume too much memory
Date Fri, 06 Apr 2012 16:33:05 GMT
Hi Binh,
        From the information you provided bucketed map join should be posible. I'm clueless
now, but still I can make one more try, if you could provide me the output of the following

1) Describe Formatted ra_md_syn;
2) Describe Formatted ra_ocs_syn.

3) hadoop fs -du <hdfs location of ra_ocs_syn >
4) hadoop fs -du <hdfs location of  ra_md_syn >

5) perform the join and paste the full console log along with the query. (with all the properties
set at CLI)

6) your hive-site.xml

@Alex
       You can use non equality conditions in the where clause. Only the ON conditions
should be equality ones.


Regards
Bejoy KS



________________________________
 From: gemini alex <gemini5201314@gmail.com>
To: user@hive.apache.org 
Cc: Bejoy Ks <bejoy_ks@yahoo.com> 
Sent: Friday, April 6, 2012 12:36 PM
Subject: Re: Why BucketJoinMap consume too much memory
 

I guess the problem is you can't using <> predicate in bucket join, try to 
select c.* from (

select /*+ MAPJOIN(b) */ a.calling calling ,a. total_volume atotal_volume , b.total_volume
btotal_volume from ra_md_syn a join ra_ocs_syn b
     on (a.calling = b.calling) ) c where c.atotal_volumn<>c.btotal_volume ;



在 2012年4月6日 上午9:19,binhnt22 <Binhnt22@viettel.com.vn>写道:

Hi Bejoy,
> 
>Sorry for late
response. I will start to demonstrate over again to clear some information.
> 
>I have 2
tables, nearly same. Both has the same table structure, 65m records, 2GB size
(same size).
>hive>
describe ra_md_syn;
>OK
>calling
string
>total_duration 
bigint
>total_volume   
bigint
>total_charge   
bigint
> 
>Both of them
were bucketized into 256 buckets on ‘calling’ column (in the last time only 10
buckets, I tried to increase it as you suggested). And I want to find all ‘calling’
exists in both tables but different ‘total_volume’
>The script as
you knew:
> 
>hive>
set hive.optimize.bucketmapjoin = true;
>hive>
set hive.enforce.bucketing=true;
>hive>
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
>hive>
select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b
>   
> on (a.calling = b.calling) where  a.total_volume <>
b.total_volume;
> 
>And the result
was exactly in my last email. Java heap space error. With total size is only
2GB and 256 buckets, I think bucket size is impossible to be the issue here.
> 
>Please give me
some advice, I really appreciate
>Best regards
>Nguyen Thanh Binh (Mr)
>Cell phone: (+84)98.226.0622
> 
>From:Bejoy Ks
[mailto:bejoy_ks@yahoo.com] 
>Sent: Thursday, April 05, 2012 7:23 PM
>
>To: user@hive.apache.org
>Subject: Re: Why BucketJoinMap consume too much memory
> 
>Hi Binh
> 
>    I was just
checking your local map join log , and I noticed two things 
>- the memory usage by one hash
table has got beyond 1G. 
>- Number of rows processed is
just 2M
> 
>It is possible that,
Each bucket it self is too large to be loaded in memory.
> 
>As a work around or to nail
down the bucket size is the issue here, can you try increasing the number of
buckets to 100 and try doing a bucketed map join.
> 
>Also you mentioned the data
size is 2Gb, is it the compressed data size?
> 
>2012-04-05 10:41:07     Processing
rows:        2,900,000 Hashtable size: 2899999 Memory usage:   1,062,065,576     
rate:   0.76
> 
>Regards
>Bejoy KS
> 
> 
> 
>
>________________________________
> 
>From:Nitin
Pawar <nitinpawar432@gmail.com>
>To: user@hive.apache.org 
>Sent: Thursday, April 5, 2012 5:03 PM
>Subject: Re: Why BucketJoinMap consume too much memory
>
>
>
>Can you
tell me the size of table b? 
> 
>If you are doing bucketing and still size b table is huge
then it will reach this problem
>On Thu,
Apr 5, 2012 at 4:22 PM, binhnt22 <Binhnt22@viettel.com.vn> wrote:
>Thank Nitin,
> 
>I tried but no luck. Here’s hive log, please spend a little time
to view it.
> 
>hive> set hive.optimize.bucketmapjoin = true;
>hive> set hive.enforce.bucketing=true;
>hive> set
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
>hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join
ra_ocs_syn b
>    > on (a.calling = b.calling) where 
a.total_volume <> b.total_volume;
>Total MapReduce jobs = 1
>WARNING: org.apache.hadoop.metrics.jvm.EventCounter is
deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the
log4j.properties files.
>Execution log at:
/tmp/hduser/hduser_20120405103737_28ef26fe-a202-4047-b5ca-c40d9e3ad36c.log
>2012-04-05 10:37:45     Starting to launch
local task to process map join;      maximum memory =
1398145024
>2012-04-05 10:37:48     Processing
rows:        200000  Hashtable size:
199999  Memory usage:  
75403880        rate:   0.054
>2012-04-05 10:37:50     Processing
rows:        300000  Hashtable size:
299999  Memory usage:  
111404664       rate:   0.08
>2012-04-05 10:37:54     Processing
rows:        400000  Hashtable size:
399999  Memory usage:  
151598960       rate:   0.108
>2012-04-05 10:38:04     Processing
rows:        500000  Hashtable size:
499999  Memory usage:  
185483368       rate:   0.133
>2012-04-05 10:38:09     Processing
rows:        600000  Hashtable size:
599999  Memory usage:  
221483392       rate:   0.158
>2012-04-05 10:38:13     Processing
rows:        700000  Hashtable size:
699999  Memory usage:  
257482640       rate:   0.184
>2012-04-05 10:38:19     Processing
rows:        800000  Hashtable size:
799999  Memory usage:  
297676944       rate:   0.213
>2012-04-05 10:38:22     Processing
rows:        900000  Hashtable size:
899999  Memory usage:  
333676968       rate:   0.239
>2012-04-05 10:38:27     Processing
rows:        1000000 Hashtable size:
999999  Memory usage:  
369676944       rate:   0.264
>2012-04-05 10:38:31     Processing
rows:        1100000 Hashtable size: 1099999
Memory usage:   405676968      
rate:   0.29
>2012-04-05 10:38:36     Processing
rows:        1200000 Hashtable size: 1199999
Memory usage:   441676944      
rate:   0.316
>2012-04-05 10:38:42     Processing
rows:        1300000 Hashtable size: 1299999
Memory usage:   477676944      
rate:   0.342
>2012-04-05 10:38:47     Processing
rows:        1400000 Hashtable size: 1399999
Memory usage:   513676968      
rate:   0.367
>2012-04-05 10:38:52     Processing
rows:        1500000 Hashtable size: 1499999
Memory usage:   549676944      
rate:   0.393
>2012-04-05 10:39:00     Processing
rows:        1600000 Hashtable size: 1599999
Memory usage:   602454200      
rate:   0.431
>2012-04-05 10:39:08     Processing rows:       
1700000 Hashtable size: 1699999 Memory usage:  
630065552       rate:   0.451
>2012-04-05 10:39:14     Processing
rows:        1800000 Hashtable size: 1799999
Memory usage:   666065552      
rate:   0.476
>2012-04-05 10:39:20     Processing
rows:        1900000 Hashtable size: 1899999
Memory usage:   702065552      
rate:   0.502
>2012-04-05 10:39:26     Processing
rows:        2000000 Hashtable size: 1999999
Memory usage:   738065576      
rate:   0.528
>2012-04-05 10:39:36     Processing
rows:        2100000 Hashtable size: 2099999
Memory usage:   774065552      
rate:   0.554
>2012-04-05 10:39:43     Processing
rows:        2200000 Hashtable size: 2199999
Memory usage:   810065552      
rate:   0.579
>2012-04-05 10:39:51     Processing
rows:        2300000 Hashtable size: 2299999
Memory usage:   846065576      
rate:   0.605
>2012-04-05 10:40:16     Processing
rows:        2400000 Hashtable size: 2399999
Memory usage:   882085136      
rate:   0.631
>2012-04-05 10:40:24     Processing
rows:        2500000 Hashtable size: 2499999
Memory usage:   918085208      
rate:   0.657
>2012-04-05 10:40:39     Processing
rows:        2600000 Hashtable size: 2599999
Memory usage:   954065544      
rate:   0.682
>2012-04-05 10:40:48     Processing
rows:        2700000 Hashtable size: 2699999
Memory usage:   990065568      
rate:   0.708
>2012-04-05 10:40:56     Processing
rows:        2800000 Hashtable size: 2799999
Memory usage:   1026065552  
   rate:   0.734
>2012-04-05 10:41:07     Processing
rows:        2900000 Hashtable size: 2899999
Memory usage:   1062065576     
rate:   0.76
>Exception in thread "Thread-1"
java.lang.OutOfMemoryError: Java heap space
> 
>Best regards
>Nguyen Thanh Binh (Mr)
>Cell phone: (+84)98.226.0622
> 
>From:Nitin
Pawar [mailto:nitinpawar432@gmail.com] 
>Sent: Thursday, April 05, 2012 5:36 PM
>
>To: user@hive.apache.org
>Subject: Re: Why BucketJoinMap consume too much memory
> 
>can you
try adding these settings 
>set
hive.enforce.bucketing=true;
>hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> 
>I have
tried bucketing with 1000 buckets and with more than 1TB data tables .. they do
go through fine 
> 
> 
>On Thu,
Apr 5, 2012 at 3:37 PM, binhnt22 <Binhnt22@viettel.com.vn> wrote:
>Hi Bejoy,
> 
>Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) and
bucketized on ‘calling’ column into 10 buckets.
> 
>As you said, hive will load only 1 bucket ~ 180-190MB into
memory. That’s hardly to blow the heap (1.3GB)
> 
>According to wiki, I set:
> 
>  set
hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>  set hive.optimize.bucketmapjoin = true;
>  set hive.optimize.bucketmapjoin.sortedmerge = true;
> 
>And run the following SQL
> 
>select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join
ra_ocs_cdr_ggsn_synthetic b 
>on (a.calling = b.calling) where  a.total_volume <>
b.total_volume;
> 
>But it still created many hash tables then threw Java Heap space
error
> 
>Best regards
>Nguyen Thanh Binh (Mr)
>Cell phone: (+84)98.226.0622
> 
>From:Bejoy
Ks [mailto:bejoy_ks@yahoo.com] 
>Sent: Thursday, April 05, 2012 3:07 PM
>To: user@hive.apache.org
>
>Subject: Re: Why BucketJoinMap consume too much memory
> 
>Hi Amit
> 
>      Sorry for the delayed response, had a
terrible schedule. AFAIK, there is no flags that would help you to take the
hash table creation, compression and load into tmp files away from client
node. 
>      From my understanding if you use a Map side
join, the small table as a whole is converted into a hash table and compressed in
a tmp file. Say if your child jvm size is 1gb and this small table is 5GB, it'd
blow off jour job if the map tasks tries to get such a huge file in memory.
Bucketed map join can help here, if the table is bucketed ,say 100 buckets then
each bucket may have around 50mb of data. ie one tmp file would be just less
that 50mb, here mapper needs to load only the required buckets
in memory and thus hardly run into memory issues.
>    Also on the client, The records are processed bucket
by bucket and loaded into tmp files. So if your bucket size is too large,
than the heap size specified for your client, it'd throw an out of memory.
> 
>Regards
>Bejoy KS
> 
>
>________________________________
> 
>From:Amit
Sharma <amitsharma1708@gmail.com>
>To: user@hive.apache.org;
Bejoy Ks <bejoy_ks@yahoo.com> 
>Sent: Tuesday, April 3, 2012 11:06 PM
>Subject: Re: Why BucketJoinMap consume too much memory
> 
>I am
experiencing similar behavior in my queries. All the conditions for bucketed
map join are met, and the only difference in execution when i set the
hive.optimize.bucketmapjoin flag to true, is that instead of a single hash
table, multiple hash tables are created. All the Hash Tables are still created
on the client side and loaded into tmp files, which are then distributed to the
mappers using distributed cache.
>
>Can i find any example anywhere, which shows behavior of bucketed map join,
where in it does not create the has tables on the client itself? If so, is
there a flag for it?
>
>Thanks,
>Amit
>On Sun,
Apr 1, 2012 at 12:35 PM, Bejoy Ks <bejoy_ks@yahoo.com> wrote:
>Hi
>    On a first look, it seems like map join is happening in
your case other than bucketed map join. The following conditions need to hold
for bucketed map join to work
>1) Both the tables are bucketed on the join columns
>2) The number of buckets in each table should be multiples of each other
>3) Ensure that the table has enough number of buckets 
>
>Note: If the data is large say 1TB(per table) and if you have just a few
buckets say 100 buckets, each mapper may have to load 10GB>. This would
definitely blow your jvm . Bottom line is ensure your mappers are not heavily
loaded with the bucketed data distribution.
>
>Regards
>Bejoy.K.S
>
>________________________________
> 
>From:binhnt22 <Binhnt22@viettel.com.vn>
>To: user@hive.apache.org 
>Sent: Saturday, March 31, 2012 6:46 AM
>Subject: Why BucketJoinMap consume too much memory
> 
>I 
have 2 table, each has 6 million records and clustered into 10 buckets
> 
>These
tables are very simple with 1 key column and 1 value column, all I want is
getting the key that exists in both table but different value.
> 
>The
normal did the trick, took only 141 secs.
> 
>select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic
b on (a.calling = b.calling) where  a.total_volume <>
b.total_volume;
> 
>I tried
to use bucket join map by setting:   set hive.optimize.bucketmapjoin = true
> 
>select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join
ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where 
a.total_volume <> b.total_volume;
> 
>2012-03-30
11:35:09     Starting to launch local task to process map
join;      maximum memory = 1398145024
>2012-03-30
11:35:12     Processing
rows:        200000  Hashtable size: 199999 
Memory usage:   86646704       
rate:   0.062
>2012-03-30
11:35:15     Processing
rows:        300000  Hashtable size:
299999  Memory usage:  
128247464       rate:   0.092
>2012-03-30
11:35:18     Processing
rows:        400000  Hashtable size:
399999  Memory usage:  
174041744       rate:   0.124
>2012-03-30
11:35:21     Processing
rows:        500000  Hashtable size:
499999  Memory usage:  
214140840       rate:   0.153
>2012-03-30
11:35:25     Processing
rows:        600000  Hashtable size:
599999  Memory usage:  
255181504       rate:   0.183
>2012-03-30
11:35:29     Processing
rows:        700000  Hashtable size:
699999  Memory usage:  
296744320       rate:   0.212
>2012-03-30
11:35:35     Processing
rows:        800000  Hashtable size:
799999  Memory usage:  
342538616       rate:   0.245
>2012-03-30
11:35:38     Processing
rows:        900000  Hashtable size:
899999  Memory usage:  
384138552       rate:   0.275
>2012-03-30
11:35:45     Processing
rows:        1000000 Hashtable size:
999999  Memory usage:  
425719576       rate:   0.304
>2012-03-30
11:35:50     Processing
rows:        1100000 Hashtable size: 1099999
Memory usage:   467319576      
rate:   0.334
>2012-03-30
11:35:56     Processing
rows:        1200000 Hashtable size: 1199999
Memory usage:   508940504      
rate:   0.364
>2012-03-30
11:36:04     Processing
rows:        1300000 Hashtable size: 1299999
Memory usage:   550521128      
rate:   0.394
>2012-03-30
11:36:09     Processing
rows:        1400000 Hashtable size: 1399999
Memory usage:   592121128      
rate:   0.424
>2012-03-30
11:36:15     Processing
rows:        1500000 Hashtable size: 1499999
Memory usage:   633720336      
rate:   0.453
>2012-03-30
11:36:22     Processing
rows:        1600000 Hashtable size: 1599999
Memory usage:   692097568      
rate:   0.495
>2012-03-30
11:36:33     Processing
rows:        1700000 Hashtable size: 1699999
Memory usage:   725308944      
rate:   0.519
>2012-03-30
11:36:40     Processing
rows:        1800000 Hashtable size: 1799999
Memory usage:   766946424      
rate:   0.549
>2012-03-30
11:36:48     Processing
rows:        1900000 Hashtable size: 1899999
Memory usage:   808527928      
rate:   0.578
>2012-03-30
11:36:55     Processing
rows:        2000000 Hashtable size: 1999999
Memory usage:   850127928      
rate:   0.608
>2012-03-30
11:37:08     Processing
rows:        2100000 Hashtable size: 2099999
Memory usage:   891708856      
rate:   0.638
>2012-03-30
11:37:16     Processing
rows:        2200000 Hashtable size: 2199999
Memory usage:   933308856       rate:  
0.668
>2012-03-30
11:37:25     Processing
rows:        2300000 Hashtable size: 2299999
Memory usage:   974908856      
rate:   0.697
>2012-03-30
11:37:34     Processing
rows:        2400000 Hashtable size: 2399999
Memory usage:   1016529448      rate:  
0.727
>2012-03-30
11:37:43     Processing
rows:        2500000 Hashtable size: 2499999
Memory usage:   1058129496     
rate:   0.757
>2012-03-30
11:37:58     Processing
rows:        2600000 Hashtable size: 2599999
Memory usage:   1099708832     
rate:   0.787
>Exception
in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
> 
>My system
has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them containts
NameNode, JobTracker, Hive Server and all of them contain DataNode, TaskTracker
> 
>In all
node, I set: export
HADOOP_HEAPSIZE=1500 in hadoop-env.sh (~ 1.3GB
heap)
> 
>I want to
ask you experts, why bucket join map consume too much memory? Am I wrong or my
configuration is bad?
> 
>Best regards,
> 
> 
> 
> 
>
>
>
> 
>-- 
>Nitin Pawar
>
>
>
> 
>-- 
>Nitin Pawar
> 
Mime
View raw message