hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy Ks <bejoy...@yahoo.com>
Subject Re: [Marketing Mail] RE: [Marketing Mail] Re: Why BucketJoinMap consume too much memory
Date Thu, 12 Apr 2012 11:51:52 GMT
Hi Anand

      Looks like there is some issue with the number of buckets, are all your partitions having the same number of buckets? Is those buckets multiple of that of the larger table? 

Hive is supposed to make n hash tables if you have n buckets, but here it is just one hash table being generated. I doubt a plain map join taking place here.

Also can you post in the stack trace from your MR task logs as well as the hive logs.

Regards
Bejoy KS



________________________________
 From: "Ladda, Anand" <lanand@microstrategy.com>
To: "user@hive.apache.org" <user@hive.apache.org> 
Sent: Thursday, April 12, 2012 2:45 AM
Subject: RE: [Marketing Mail] RE: [Marketing Mail] Re: Why BucketJoinMap consume too much memory
 

 
Thanks Binh. This helps! When I run a job that just access a single partition of each of the tables, they run ok but there isn’t much benefit of doing the bucketed map join in that scenario. I tried increasing the number of partitions to join hoping that bucketed map joins would show improvements over doing a common join but either. But then the jobs start to fail to Java Heap Size issue. Experts any thoughts here? I pasted the hive CLI information below
 
hive> select /*+ MAPJOIN(a12) */a12.shipper_id, count(1), count (distinct a11.customer_id), sum(a11.qty_sold) from Orderfactpartclust4 a12 join Orderdetailpartclust4 a11 on (a11.order_id = a12.order_id) where (a11.order_date <= '09-03-2008' and a12.order_date  <= '2008-09-03') group by a12.shipper_id;
Total MapReduce jobs = 2
Execution log at: /tmp/hdfs/hdfs_20120411170000_7e6249ec-a05a-4474-b7f0-b68b3da44650.log
2012-04-11 05:00:27     Starting to launch local task to process map join;      maximum memory = 1004929024
2012-04-11 05:00:30     Processing rows:        200000  Hashtable size: 199999  Memory usage:   80511368        rate:   0.08
2012-04-11 05:00:32     Processing rows:        300000  Hashtable size: 299999  Memory usage:   116901784       rate:   0.116
2012-04-11 05:00:34     Processing rows:        400000  Hashtable size: 399999  Memory usage:   162823040       rate:   0.162
2012-04-11 05:00:36     Processing rows:        500000  Hashtable size: 499999  Memory usage:   194695656       rate:   0.194
2012-04-11 05:00:38     Processing rows:        600000  Hashtable size: 599999  Memory usage:   232139096       rate:   0.231
2012-04-11 05:00:40     Processing rows:        700000  Hashtable size: 699999  Memory usage:   268303576       rate:   0.267
2012-04-11 05:00:43     Processing rows:        800000  Hashtable size: 799999  Memory usage:   322688448       rate:   0.321
2012-04-11 05:00:46     Processing rows:        900000  Hashtable size: 899999  Memory usage:   350289528       rate:   0.349
2012-04-11 05:00:49     Processing rows:        1000000 Hashtable size: 999999  Memory usage:   387732920       rate:   0.386
2012-04-11 05:00:52     Processing rows:        1100000 Hashtable size: 1099999 Memory usage:   423889488       rate:   0.422
2012-04-11 05:00:55     Processing rows:        1200000 Hashtable size: 1199999 Memory usage:   461331544       rate:   0.459
2012-04-11 05:00:59     Processing rows:        1300000 Hashtable size: 1299999 Memory usage:   497489360       rate:   0.495
2012-04-11 05:01:02     Processing rows:        1400000 Hashtable size: 1399999 Memory usage:   534954712       rate:   0.532
2012-04-11 05:01:07     Processing rows:        1500000 Hashtable size: 1499999 Memory usage:   571084408       rate:   0.568
2012-04-11 05:01:12     Processing rows:        1600000 Hashtable size: 1599999 Memory usage:   642318584       rate:   0.639
2012-04-11 05:01:18     Processing rows:        1700000 Hashtable size: 1699999 Memory usage:   661462456       rate:   0.658
2012-04-11 05:01:23     Processing rows:        1800000 Hashtable size: 1799999 Memory usage:   698903360       rate:   0.695
2012-04-11 05:01:28     Processing rows:        1900000 Hashtable size: 1899999 Memory usage:   735061568       rate:   0.731
2012-04-11 05:01:33     Processing rows:        2000000 Hashtable size: 1999999 Memory usage:   784973632       rate:   0.781
2012-04-11 05:01:39     Processing rows:        2082319 Hashtable size: 2082319 Memory usage:   801950600       rate:   0.798
2012-04-11 05:01:39     Dump the hashtable into file: file:/tmp/hdfs/hive_2012-04-11_17-00-24_492_2227050132132538171/-local-10003/HashTable-Stage-1/MapJoin-0--.hashtable
2012-04-11 05:02:22     Upload 1 File to: file:/tmp/hdfs/hive_2012-04-11_17-00-24_492_2227050132132538171/-local-10003/HashTable-Stage-1/MapJoin-0--.hashtable File size: 118691974
2012-04-11 05:02:22     End of local task; Time Taken: 114.907 sec.
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201202131643_1362, Tracking URL = http://hadoop001:50030/jobdetails.jsp?jobid=job_201202131643_1362
Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=hadoop001:6932 -kill job_201202131643_1362
2012-04-11 17:02:32,402 Stage-1 map = 0%,  reduce = 0%
2012-04-11 17:02:40,486 Stage-1 map = 2%,  reduce = 0%
2012-04-11 17:02:41,603 Stage-1 map = 6%,  reduce = 0%
2012-04-11 17:02:42,626 Stage-1 map = 8%,  reduce = 0%
2012-04-11 17:02:44,639 Stage-1 map = 9%,  reduce = 0%
2012-04-11 17:02:47,650 Stage-1 map = 6%,  reduce = 0%
2012-04-11 17:02:48,654 Stage-1 map = 2%,  reduce = 0%
2012-04-11 17:02:49,658 Stage-1 map = 1%,  reduce = 0%
2012-04-11 17:02:52,669 Stage-1 map = 0%,  reduce = 0%
2012-04-11 17:02:56,682 Stage-1 map = 2%,  reduce = 0%
2012-04-11 17:02:57,687 Stage-1 map = 5%,  reduce = 0%
2012-04-11 17:02:58,691 Stage-1 map = 7%,  reduce = 0%
2012-04-11 17:02:59,708 Stage-1 map = 8%,  reduce = 0%
2012-04-11 17:03:03,740 Stage-1 map = 6%,  reduce = 0%
2012-04-11 17:03:04,750 Stage-1 map = 4%,  reduce = 0%
2012-04-11 17:03:05,755 Stage-1 map = 3%,  reduce = 0%
2012-04-11 17:03:06,759 Stage-1 map = 2%,  reduce = 0%
2012-04-11 17:03:07,764 Stage-1 map = 1%,  reduce = 0%
2012-04-11 17:03:12,781 Stage-1 map = 2%,  reduce = 0%
2012-04-11 17:03:13,819 Stage-1 map = 5%,  reduce = 0%
2012-04-11 17:03:14,823 Stage-1 map = 6%,  reduce = 0%
2012-04-11 17:03:15,828 Stage-1 map = 7%,  reduce = 0%
2012-04-11 17:03:17,837 Stage-1 map = 8%,  reduce = 0%
2012-04-11 17:03:20,858 Stage-1 map = 5%,  reduce = 0%
2012-04-11 17:03:21,865 Stage-1 map = 4%,  reduce = 0%
2012-04-11 17:03:23,874 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201202131643_1362 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
 
From:Binhnt22@viettel.com.vn [mailto:Binhnt22@viettel.com.vn] 
Sent: Tuesday, April 10, 2012 9:36 PM
To: user@hive.apache.org
Subject: [Marketing Mail] RE: [Marketing Mail] Re: Why BucketJoinMap consume too much memory
 
Hi Ladda,
 
Your case is pretty simple, when you make table alias (a11, a12), you should use it in the hint MAPJOIN
 
That’s mean your sql should be look like:
 
select /*+ MAPJOIN(a11) */ a12.shipper_id, count(1), count (distinct a11.customer_id), sum(a11.qty_sold) from orderfactpartclust2 a12 join orderdetailpartclust2 a11 on (a11.order_id = a12.order_id) where (a11.order_date = '09-30-2008' and a12.order_date = '2008-09-30') group by a12.shipper_id;
 
Best regards
Nguyen Thanh Binh (Mr)
 
From:Ladda, Anand [mailto:lanand@microstrategy.com] 
Sent: Wednesday, April 11, 2012 3:23 AM
To: user@hive.apache.org
Subject: RE: [Marketing Mail] Re: Why BucketJoinMap consume too much memory
 
Hi Bejoy/Binh
Been following this thread to better understand where bucket map join would help and it’s been a great thread to follow. I have struggling with this on my end as well. 
 
I have two tables one of which is about 22GB (orderdetailpartclust2) in size and the other is 1.5GB (orderfactpartclust2) in size (all partitions combined) and I wanted to see the impact of different kind of joins on one of the partitions of these table . 
 
I created a partitioned (order_date) and bucketed (on order_id, on which I want to join these tables) version for these tables for this analysis. Data was loaded from their non-partitioned counterparts and setting the following parameters to ensure that data makes it into the right partitions and is bucketed correctly by Hive
 
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;
set hive.enforce.bucketing = true;
 
However when I try to do the following join query, I don’t get any bucketed map side join
 
select /*+ MAPJOIN(orderfactpartclust2) */ a12.shipper_id, count(1), count (distinct a11.customer_id), sum(a11.qty_sold) from orderfactpartclust2 a12 join orderdetailpartclust2 a11 on (a11.order_id = a12.order_id) where (a11.order_date = '09-30-2008' and a12.order_date = '2008-09-30') group by a12.shipper_id;
 
Below are the relevant pieces of information on each of these tables. Can you please help take a look to see what I might be missing to get map side joins. Is it because my tables are also partitioned that this isn’t working?
 
 
1.       hive> describe formatted orderdetailpartclust2;
OK
# col_name              data_type               comment
 
order_id                int                     from deserializer
item_id                 int                     from deserializer
emp_id                  int                     from deserializer
promotion_id            int                     from deserializer
customer_id             int                     from deserializer
qty_sold                float                   from deserializer
unit_price              float                   from deserializer
unit_cost               float                   from deserializer
discount                float                   from deserializer
 
# Partition Information
# col_name              data_type               comment
 
order_date              string                  None
 
# Detailed Table Information
Database:               default
Owner:                  hdfs
CreateTime:             Thu Apr 05 17:01:22 EDT 2012
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2
Table Type:             MANAGED_TABLE
Table Parameters:
        SORTBUCKETCOLSPREFIX    TRUE
        numFiles                19200
        numPartitions           75
        numRows                 0
        totalSize               22814162038
        transient_lastDdlTime   1333725153
 
# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.RCFileInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.RCFileOutputFormat
Compressed:             No
Num Buckets:            256
Bucket Columns:         [order_id]
Sort Columns:           [Order(col:order_id, order:1)]
Storage Desc Params:
        escape.delim            \\
        field.delim             \t
        serialization.format    \t
Time taken: 3.255 seconds
2.       hive> describe formatted orderfactpartclust2;
OK
# col_name              data_type               comment
 
order_id                int                     from deserializer
emp_id                  int                     from deserializer
order_amt               float                   from deserializer
order_cost              float                   from deserializer
qty_sold                float                   from deserializer
freight                 float                   from deserializer
gross_dollar_sales      float                   from deserializer
ship_date               string                  from deserializer
rush_order              string                  from deserializer
customer_id             int                     from deserializer
pymt_type               int                     from deserializer
shipper_id              int                     from deserializer
 
# Partition Information
# col_name              data_type               comment
 
order_date              string                  None
 
# Detailed Table Information
Database:               default
Owner:                  hdfs
CreateTime:             Thu Apr 05 18:09:28 EDT 2012
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2
Table Type:             MANAGED_TABLE
Table Parameters:
        SORTBUCKETCOLSPREFIX    TRUE
        numFiles                7680
        numPartitions           30
        numRows                 0
        totalSize               1528946078
        transient_lastDdlTime   1333722539
 
# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.RCFileInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.RCFileOutputFormat
Compressed:             No
Num Buckets:            256
Bucket Columns:         [order_id]
Sort Columns:           [Order(col:order_id, order:1)]
Storage Desc Params:
        escape.delim            \\
        field.delim             \t
        serialization.format    \t
Time taken: 1.737 seconds
 
3.       -bash-4.1$ hadoop fs -du /user/hive/warehouse/orderdetailpartclust2;
299867901   hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_date=01-01-2008
.
.
.
311033139   hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_date=09-30-2008
4.       -bash-4.1$ hadoop fs -du /user/hive/warehouse/orderdetailpartclust2/order_date=09-30-2008;
Found 256 items
1213444     hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_date=09-30-2008/000000_0
.
.
.
1213166     hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_date=09-30-2008/000255_0
-bash-4.1$
5.       -bash-4.1$ hadoop fs -du /user/hive/warehouse/orderfactpartclust2;
Found 30 items
50943109    hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=2008-09-01
.
.
.
50902368    hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=2008-09-30
6.       bash-4.1$ hadoop fs -du /user/hive/warehouse/orderfactpartclust2/order_date=2008-09-30;
Found 256 items
198692      hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=2008-09-30/000000_0
.
.
.
198954      hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=2008-09-30/000255_0
 
7.       -bash-4.1$ cat hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 
<configuration>
 
<!-- Hive Configuration can either be stored in this file or in the hadoop configuration files  -->
<!-- that are implied by Hadoop setup variables.                                                -->
<!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive    -->
<!-- users do not have to edit hadoop configuration files (that may be managed as a centralized -->
<!-- resource).                                                                                 -->
 
<!-- Hive Execution Parameters -->
 
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <!-- jdbc:derby:/hadoophome/metastore_db;create=true -->
  <value>jdbc:derby://hadoop010:1527/;databaseName=metastore_db;create=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.apache.derby.jdbc.EmbeddedDriver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>
 
<property>
  <name>hive.hwi.war.file</name>
  <value>/usr/lib/hive/lib/hive-hwi-0.7.0-cdh3u0.war</value>
  <description>This is the WAR file with the jsp content for Hive Web Interface</description>
</property>
 
</configuration>
 
8.       Performing Join
 
hive> set hive.optimize.bucketmapjoin=true;
hive> set hive.enforce.bucketing=true;
hive> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
hive> select /*+ MAPJOIN(orderfactpartclust2) */ a12.shipper_id, count(1), count (distinct a11.customer_id), sum(a11.qty_sold) from orderfactpartclust2 a12 join orderdetailpartclust2 a11 on (a11.order_id = a12.order_id) where (a11.order_date = '09-30-2008' and a12.order_date = '2008-09-30') group by a12.shipper_id;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201202131643_1294, Tracking URL = http://hadoop001:50030/jobdetails.jsp?jobid=job_201202131643_1294
Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=hadoop001:6932 -kill job_201202131643_1294
2012-04-10 16:15:06,663 Stage-1 map = 0%,  reduce = 0%
2012-04-10 16:15:08,671 Stage-1 map = 1%,  reduce = 0%
2012-04-10 16:15:09,675 Stage-1 map = 3%,  reduce = 0%
2012-04-10 16:15:10,679 Stage-1 map = 4%,  reduce = 0%
2012-04-10 16:15:11,683 Stage-1 map = 5%,  reduce = 0%
2012-04-10 16:15:12,688 Stage-1 map = 7%,  reduce = 0%
2012-04-10 16:15:13,692 Stage-1 map = 8%,  reduce = 0%
2012-04-10 16:15:14,697 Stage-1 map = 10%,  reduce = 0%
2012-04-10 16:15:15,756 Stage-1 map = 12%,  reduce = 0%
2012-04-10 16:15:16,761 Stage-1 map = 13%,  reduce = 0%
2012-04-10 16:15:17,767 Stage-1 map = 14%,  reduce = 0%
2012-04-10 16:15:18,773 Stage-1 map = 16%,  reduce = 0%
2012-04-10 16:15:19,778 Stage-1 map = 17%,  reduce = 1%
2012-04-10 16:15:20,784 Stage-1 map = 18%,  reduce = 1%
2012-04-10 16:15:21,789 Stage-1 map = 20%,  reduce = 1%
2012-04-10 16:15:22,795 Stage-1 map = 21%,  reduce = 5%
2012-04-10 16:15:23,800 Stage-1 map = 23%,  reduce = 5%
2012-04-10 16:15:24,805 Stage-1 map = 24%,  reduce = 5%
2012-04-10 16:15:25,936 Stage-1 map = 25%,  reduce = 8%
2012-04-10 16:15:26,941 Stage-1 map = 27%,  reduce = 8%
2012-04-10 16:15:27,947 Stage-1 map = 28%,  reduce = 8%
2012-04-10 16:15:28,951 Stage-1 map = 30%,  reduce = 8%
2012-04-10 16:15:29,956 Stage-1 map = 31%,  reduce = 8%
2012-04-10 16:15:30,981 Stage-1 map = 32%,  reduce = 8%
2012-04-10 16:15:31,987 Stage-1 map = 34%,  reduce = 8%
2012-04-10 16:15:32,992 Stage-1 map = 35%,  reduce = 10%
2012-04-10 16:15:33,998 Stage-1 map = 37%,  reduce = 10%
2012-04-10 16:15:35,003 Stage-1 map = 38%,  reduce = 10%
2012-04-10 16:15:36,055 Stage-1 map = 40%,  reduce = 10%
2012-04-10 16:15:37,097 Stage-1 map = 42%,  reduce = 10%
2012-04-10 16:15:38,102 Stage-1 map = 43%,  reduce = 10%
2012-04-10 16:15:39,108 Stage-1 map = 44%,  reduce = 10%
2012-04-10 16:15:40,113 Stage-1 map = 46%,  reduce = 10%
2012-04-10 16:15:41,123 Stage-1 map = 47%,  reduce = 10%
2012-04-10 16:15:42,128 Stage-1 map = 49%,  reduce = 15%
2012-04-10 16:15:43,134 Stage-1 map = 50%,  reduce = 15%
2012-04-10 16:15:44,139 Stage-1 map = 53%,  reduce = 15%
2012-04-10 16:15:46,152 Stage-1 map = 54%,  reduce = 15%
2012-04-10 16:15:47,158 Stage-1 map = 57%,  reduce = 15%
2012-04-10 16:15:48,164 Stage-1 map = 58%,  reduce = 15%
2012-04-10 16:15:49,171 Stage-1 map = 60%,  reduce = 15%
2012-04-10 16:15:50,176 Stage-1 map = 61%,  reduce = 15%
2012-04-10 16:15:51,182 Stage-1 map = 63%,  reduce = 19%
2012-04-10 16:15:52,199 Stage-1 map = 65%,  reduce = 19%
2012-04-10 16:15:53,222 Stage-1 map = 66%,  reduce = 19%
2012-04-10 16:15:54,228 Stage-1 map = 68%,  reduce = 19%
2012-04-10 16:15:55,234 Stage-1 map = 70%,  reduce = 19%
2012-04-10 16:15:56,241 Stage-1 map = 71%,  reduce = 19%
2012-04-10 16:15:57,248 Stage-1 map = 73%,  reduce = 21%
2012-04-10 16:15:58,253 Stage-1 map = 75%,  reduce = 21%
2012-04-10 16:15:59,260 Stage-1 map = 76%,  reduce = 21%
2012-04-10 16:16:00,267 Stage-1 map = 79%,  reduce = 21%
2012-04-10 16:16:01,273 Stage-1 map = 80%,  reduce = 21%
2012-04-10 16:16:02,280 Stage-1 map = 81%,  reduce = 21%
2012-04-10 16:16:03,287 Stage-1 map = 83%,  reduce = 27%
2012-04-10 16:16:04,294 Stage-1 map = 84%,  reduce = 27%
2012-04-10 16:16:05,302 Stage-1 map = 86%,  reduce = 27%
2012-04-10 16:16:06,310 Stage-1 map = 87%,  reduce = 27%
2012-04-10 16:16:07,317 Stage-1 map = 90%,  reduce = 27%
2012-04-10 16:16:08,325 Stage-1 map = 91%,  reduce = 27%
2012-04-10 16:16:09,332 Stage-1 map = 92%,  reduce = 27%
2012-04-10 16:16:10,339 Stage-1 map = 94%,  reduce = 27%
2012-04-10 16:16:11,348 Stage-1 map = 95%,  reduce = 27%
2012-04-10 16:16:12,355 Stage-1 map = 97%,  reduce = 29%
2012-04-10 16:16:13,362 Stage-1 map = 99%,  reduce = 29%
2012-04-10 16:16:14,370 Stage-1 map = 100%,  reduce = 29%
2012-04-10 16:16:18,396 Stage-1 map = 100%,  reduce = 32%
2012-04-10 16:16:24,654 Stage-1 map = 100%,  reduce = 67%
2012-04-10 16:16:27,683 Stage-1 map = 100%,  reduce = 70%
2012-04-10 16:16:30,701 Stage-1 map = 100%,  reduce = 73%
2012-04-10 16:16:33,719 Stage-1 map = 100%,  reduce = 77%
2012-04-10 16:16:36,739 Stage-1 map = 100%,  reduce = 80%
2012-04-10 16:16:39,781 Stage-1 map = 100%,  reduce = 84%
2012-04-10 16:16:42,806 Stage-1 map = 100%,  reduce = 88%
2012-04-10 16:16:45,824 Stage-1 map = 100%,  reduce = 92%
2012-04-10 16:16:48,840 Stage-1 map = 100%,  reduce = 97%
2012-04-10 16:16:50,854 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201202131643_1294
Launching Job 2 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201202131643_1295, Tracking URL = http://hadoop001:50030/jobdetails.jsp?jobid=job_201202131643_1295
Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=hadoop001:6932 -kill job_201202131643_1295
2012-04-10 16:16:56,693 Stage-2 map = 0%,  reduce = 0%
2012-04-10 16:17:02,716 Stage-2 map = 100%,  reduce = 0%
2012-04-10 16:17:12,759 Stage-2 map = 100%,  reduce = 100%
Ended Job = job_201202131643_1295
OK
1       678832  67850   678832.0
2       1360529 135253  1360529.0
3       4784635 460994  4784635.0
Time taken: 131.748 seconds
hive>
From:Bejoy Ks [mailto:bejoy_ks@yahoo.com] 
Sent: Tuesday, April 10, 2012 11:44 AM
To: user@hive.apache.org
Subject: [Marketing Mail] Re: Why BucketJoinMap consume too much memory
 
Hi Binh
      You are right, here both of your tables are of the same size. And loading 2GB od data into hash tables and then to temp files and so on would take some time. This time becomes negligible if it was like, one table was of 2GB and other of 2TB, then you'll notice the wide difference in performance between a common join and bucketed map join.
      If one of the table is too small map join would be good, if it is of moderate size then bucketed map join. 
 
Regards
Bejoy KS
 

________________________________
 
From:binhnt22 <Binhnt22@viettel.com.vn>
To: user@hive.apache.org 
Cc: 'Bejoy Ks' <bejoy_ks@yahoo.com> 
Sent: Tuesday, April 10, 2012 8:10 AM
Subject: RE: Why BucketJoinMap consume too much memory
 
Hi Bejoy,
 
It worked like a charm. Thank you very much. I really really appreciate your help.
 
This bucket join should be used with 1 big table and 1 small table. 
 
If both table are big, the join time would be much more than normal join.
 
Best regards
Nguyen Thanh Binh (Mr)
Cell phone: (+84)98.226.0622
 
From:Bejoy Ks [mailto:bejoy_ks@yahoo.com] 
Sent: Monday, April 09, 2012 9:49 PM
To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much memory
 
Hi Binh
     It is just an issue with the number of buckets. Your tables has just 8 buckets, as there only 8 files are seen the storage directory. You might have just issued an ALTER TABLE script on an existing bucketed table. The work around here is
 
1) You need to wipe and reload the tables with hive.enforce.bucketing=true;
       Ensure your storage directory as that many files as the number of buckets. As per your table DDL you should see 256 files.
 
2) Enable hive.optimize.bucketmapjoin = true; and try doing the join again.
 
It should definitely work.
 
Regards
Bejoy KS
 

________________________________
 
From:binhnt22 <Binhnt22@viettel.com.vn>
To: user@hive.apache.org 
Cc: 'Bejoy Ks' <bejoy_ks@yahoo.com> 
Sent: Monday, April 9, 2012 8:42 AM
Subject: RE: Why BucketJoinMap consume too much memory
 
Hi Bejoy,

Thank you for helping me. Here is the information 
 
1.      Describe Formatted ra_md_syn;
# col_name              data_type               comment
 
calling                 string                  None
total_duration          bigint                  None
total_volume            bigint                  None
total_charge            bigint                  None
 
# Detailed Table Information
Database:               default
Owner:                  hduser
CreateTime:             Thu Apr 05 09:48:29 ICT 2012
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://master:54310/user/hive/warehouse/ra_md_syn
Table Type:             MANAGED_TABLE
Table Parameters:
        numFiles                8
        numPartitions           0
        numRows                 0
        rawDataSize             0
        totalSize               1872165483
        transient_lastDdlTime   1333595095
 
# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputForm                   at
Compressed:             No
Num Buckets:            256
Bucket Columns:         [calling]
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
 
2.      Describe Formatted ra_ocs_syn;
# col_name              data_type               comment
 
calling                 string                  None
total_duration          bigint                  None
total_volume            bigint                  None
total_charge            bigint                  None
 
# Detailed Table Information
Database:               default
Owner:                  hduser
CreateTime:             Thu Apr 05 09:48:24 ICT 2012
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://master:54310/user/hive/warehouse/ra_ocs_syn
Table Type:             MANAGED_TABLE
Table Parameters:
        numFiles                8
        numPartitions           0
        numRows                 0
        rawDataSize             0
        totalSize               1872162225
        transient_lastDdlTime   1333595512
 
# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            256
Bucket Columns:         [calling]
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
 
3.      hadoop fs -du <hdfs location of ra_md_syn >
[hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_md_syn
Found 8 items
280371407   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000000_1
280371407   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000001_0
274374970   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000002_1
274374970   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000003_1
269949415   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000004_0
262439067   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000005_0
205767721   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000006_0
24516526    hdfs://master:54310/user/hive/warehouse/ra_md_syn/000007_0
 
4.      hadoop fs -du <hdfs location of  ra_ocs_syn >
[hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_ocs_syn
Found 8 items
314639270   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000000_0
314639270   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000001_0
304959363   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000002_0
274374381   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000003_1
264694474   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000004_0
257498693   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000005_1
100334604   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000006_0
41022170    hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000007_0
 
5.      hive-site.xml
[hduser@master ~]$ cat hive-0.8.1/conf/hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 
<configuration>
        <property>
                <name>hive.metastore.local</name>
                <value>true</value>
        </property>
        <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
        </property>
        <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        </property>
        <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hadoop</value>
        </property>
        <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hadoop</value>
        </property>
 
 
        <property>
          <name>hive.metastore.sasl.enabled</name>
          <value>true</value>
          <description>If true, the metastore thrift interface will be secured with SASL. Clients must authenticate with Kerberos.</description>
        </property>
 
        <property>
          <name>hive.metastore.kerberos.keytab.file</name>
          <value></value>
          <description>The path to the Kerberos Keytab file containing the metastore thrift server's service principal.</description>
        </property>
 
        <property>
          <name>hive.metastore.kerberos.principal</name>
          <value>hduser/admin@EXAMPLE.COM</value>
          <description>The service principal for the metastore thrift server. The special string _HOST will be replaced automatically with the correct host name.</description>
        </property>
</configuration>
 
6.      Performing JOIN
hive> set hive.optimize.bucketmapjoin = true;
hive> set hive.enforce.bucketing=true;
hive> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b
    > on (a.calling = b.calling) where  a.total_volume <> b.total_volume;
Total MapReduce jobs = 1
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/hduser/hduser_20120409013131_da40d787-597d-490c-8558-9d10ec11e916.log
2012-04-09 01:31:24     Starting to launch local task to process map join;      maximum memory = 13                           98145024
2012-04-09 01:31:28     Processing rows:        200000  Hashtable size: 199999  Memory usage:   754                           05504   rate:   0.054
2012-04-09 01:31:29     Processing rows:        300000  Hashtable size: 299999  Memory usage:   111                           540296  rate:   0.08
2012-04-09 01:31:32     Processing rows:        400000  Hashtable size: 399999  Memory usage:   151                           640080  rate:   0.108
2012-04-09 01:31:35     Processing rows:        500000  Hashtable size: 499999  Memory usage:   185                           503416  rate:   0.133
2012-04-09 01:31:37     Processing rows:        600000  Hashtable size: 599999  Memory usage:   221                           503440  rate:   0.158
2012-04-09 01:31:42     Processing rows:        700000  Hashtable size: 699999  Memory usage:   257                           484264  rate:   0.184
2012-04-09 01:31:47     Processing rows:        800000  Hashtable size: 799999  Memory usage:   297                           678568  rate:   0.213
2012-04-09 01:31:52     Processing rows:        900000  Hashtable size: 899999  Memory usage:   333                           678592  rate:   0.239
2012-04-09 01:31:57     Processing rows:        1000000 Hashtable size: 999999  Memory usage:   369                           678568  rate:   0.264
2012-04-09 01:32:03     Processing rows:        1100000 Hashtable size: 1099999 Memory usage:   405                           678568  rate:   0.29
2012-04-09 01:32:09     Processing rows:        1200000 Hashtable size: 1199999 Memory usage:   441                           678592  rate:   0.316
2012-04-09 01:32:15     Processing rows:        1300000 Hashtable size: 1299999 Memory usage:   477                           678568  rate:   0.342
2012-04-09 01:32:23     Processing rows:        1400000 Hashtable size: 1399999 Memory usage:   513                           678592  rate:   0.367
2012-04-09 01:32:29     Processing rows:        1500000 Hashtable size: 1499999 Memory usage:   549                           678568  rate:   0.393
2012-04-09 01:32:35     Processing rows:        1600000 Hashtable size: 1599999 Memory usage:   602                           455824  rate:   0.431
2012-04-09 01:32:45     Processing rows:        1700000 Hashtable size: 1699999 Memory usage:   630                           067176  rate:   0.451
2012-04-09 01:32:53     Processing rows:        1800000 Hashtable size: 1799999 Memory usage:   666                           067176  rate:   0.476
2012-04-09 01:33:01     Processing rows:        1900000 Hashtable size: 1899999 Memory usage:   702                           067200  rate:   0.502
2012-04-09 01:33:09     Processing rows:        2000000 Hashtable size: 1999999 Memory usage:   738                           067176  rate:   0.528
2012-04-09 01:33:20     Processing rows:        2100000 Hashtable size: 2099999 Memory usage:   774                           254456  rate:   0.554
2012-04-09 01:33:29     Processing rows:        2200000 Hashtable size: 2199999 Memory usage:   810                           067176  rate:   0.579
2012-04-09 01:33:38     Processing rows:        2300000 Hashtable size: 2299999 Memory usage:   846                           568480  rate:   0.605
2012-04-09 01:33:49     Processing rows:        2400000 Hashtable size: 2399999 Memory usage:   882                           096752  rate:   0.631
2012-04-09 01:33:59     Processing rows:        2500000 Hashtable size: 2499999 Memory usage:   918                           821920  rate:   0.657
2012-04-09 01:34:15     Processing rows:        2600000 Hashtable size: 2599999 Memory usage:   954                           134920  rate:   0.682
2012-04-09 01:34:26     Processing rows:        2700000 Hashtable size: 2699999 Memory usage:   990                           067168  rate:   0.708
2012-04-09 01:34:38     Processing rows:        2800000 Hashtable size: 2799999 Memory usage:   102                           7113288 rate:   0.735
Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2882)
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:597)
        at java.lang.StringBuilder.append(StringBuilder.java:212)
        at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:247)
        at org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:232)
Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
        at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:315)
        at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:310)
        at java.util.jar.Manifest.read(Manifest.java:178)
        at java.util.jar.Manifest.<init>(Manifest.java:52)
        at java.util.jar.JarFile.getManifestFromReference(JarFile.java:167)
        at java.util.jar.JarFile.getManifest(JarFile.java:148)
        at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:696)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:228)
        at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        at org.apache.hadoop.util.RunJar$1.run(RunJar.java:126)
Execution failed with exit status: 2
Obtaining error information
 
Task failed!
Task ID:
  Stage-3
 
Logs:
 
/tmp/hduser/hive.log
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapredLocalTask
7.      /tmp/hduser/hive.log
2012-04-09 02:00:10,654 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.resources" but it cannot be resolved.
2012-04-09 02:00:10,654 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.resources" but it cannot be resolved.
2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.runtime" but it cannot be resolved.
2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.runtime" but it cannot be resolved.
2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.text" but it cannot be resolved.
2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.text" but it cannot be resolved.
2012-04-09 02:00:12,796 WARN  parse.SemanticAnalyzer (SemanticAnalyzer.java:genBodyPlan(5821)) - Common Gby keys:null
2012-04-09 02:09:02,356 ERROR exec.Task (SessionState.java:printError(380)) - Execution failed with exit status: 2
2012-04-09 02:09:02,357 ERROR exec.Task (SessionState.java:printError(380)) - Obtaining error information
2012-04-09 02:09:02,358 ERROR exec.Task (SessionState.java:printError(380)) -
Task failed!
Task ID:
  Stage-3
 
Logs:
 
2012-04-09 02:09:02,358 ERROR exec.Task (SessionState.java:printError(380)) - /tmp/hduser/hive.log
2012-04-09 02:09:02,359 ERROR exec.MapredLocalTask (MapredLocalTask.java:execute(228)) - Execution failed with exit status: 2
2012-04-09 02:09:02,377 ERROR ql.Driver (SessionState.java:printError(380)) - FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapredLocalTask
 
 
Best regards
Nguyen Thanh Binh (Mr)
Cell phone: (+84)98.226.0622
 
From:Bejoy Ks [mailto:bejoy_ks@yahoo.com] 
Sent: Friday, April 06, 2012 11:33 PM
To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much memory
 
Hi Binh,
        From the information you provided bucketed map join should be posible. I'm clueless now, but still I can make one more try, if you could provide me the output of the following
 
1) Describe Formatted ra_md_syn;
2) Describe Formatted ra_ocs_syn.
 
3) hadoop fs -du <hdfs location of ra_ocs_syn >
4) hadoop fs -du <hdfs location of  ra_md_syn >
 
5) perform the join and paste the full console log along with the query. (with all the properties set at CLI)
 
6) your hive-site.xml
 
@Alex
       You can use non equality conditions in the where clause. Only the ON conditions should be equality ones.
 
 
Regards
Bejoy KS
 
 

________________________________
 
From:gemini alex <gemini5201314@gmail.com>
To: user@hive.apache.org 
Cc: Bejoy Ks <bejoy_ks@yahoo.com> 
Sent: Friday, April 6, 2012 12:36 PM
Subject: Re: Why BucketJoinMap consume too much memory
 
I guess the problem is you can't using <> predicate in bucket join, try to 
select c.* from (
select /*+ MAPJOIN(b) */ a.calling calling ,a.total_volume atotal_volume , b.total_volume btotal_volume from ra_md_syn a join ra_ocs_syn b
     on (a.calling = b.calling) ) c where c.atotal_volumn<>c.btotal_volume ;
 
 
 
在2012年4月6日上午9:19,binhnt22 <Binhnt22@viettel.com.vn>写道:
Hi Bejoy,
 
Sorry for late response. I will start to demonstrate over again to clear some information.
 
I have 2 tables, nearly same. Both has the same table structure, 65m records, 2GB size (same size).
hive> describe ra_md_syn;
OK
calling string
total_duration  bigint
total_volume    bigint
total_charge    bigint
 
Both of them were bucketized into 256 buckets on ‘calling’ column (in the last time only 10 buckets, I tried to increase it as you suggested). And I want to find all ‘calling’ exists in both tables but different ‘total_volume’
The script as you knew:
 
hive> set hive.optimize.bucketmapjoin = true;
hive> set hive.enforce.bucketing=true;
hive> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b
    > on (a.calling = b.calling) where  a.total_volume <> b.total_volume;
 
And the result was exactly in my last email. Java heap space error. With total size is only 2GB and 256 buckets, I think bucket size is impossible to be the issue here.
 
Please give me some advice, I really appreciate
Best regards
Nguyen Thanh Binh (Mr)
Cell phone: (+84)98.226.0622
 
From:Bejoy Ks [mailto:bejoy_ks@yahoo.com] 
Sent: Thursday, April 05, 2012 7:23 PM

To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much memory
 
Hi Binh
 
    I was just checking your local map join log , and I noticed two things 
- the memory usage by one hash table has got beyond 1G. 
- Number of rows processed is just 2M
 
It is possible that, Each bucket it self is too large to be loaded in memory.
 
As a work around or to nail down the bucket size is the issue here, can you try increasing the number of buckets to 100 and try doing a bucketed map join.
 
Also you mentioned the data size is 2Gb, is it the compressed data size?
 
2012-04-05 10:41:07     Processing rows:        2,900,000 Hashtable size: 2899999 Memory usage:   1,062,065,576      rate:   0.76
 
Regards
Bejoy KS
 
 
 

________________________________
 
From:Nitin Pawar <nitinpawar432@gmail.com>
To: user@hive.apache.org 
Sent: Thursday, April 5, 2012 5:03 PM
Subject: Re: Why BucketJoinMap consume too much memory
 
Can you tell me the size of table b? 
 
If you are doing bucketing and still size b table is huge then it will reach this problem
On Thu, Apr 5, 2012 at 4:22 PM, binhnt22 <Binhnt22@viettel.com.vn> wrote:
Thank Nitin,
 
I tried but no luck. Here’s hive log, please spend a little time to view it.
 
hive> set hive.optimize.bucketmapjoin = true;
hive> set hive.enforce.bucketing=true;
hive> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b
    > on (a.calling = b.calling) where  a.total_volume <> b.total_volume;
Total MapReduce jobs = 1
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/hduser/hduser_20120405103737_28ef26fe-a202-4047-b5ca-c40d9e3ad36c.log
2012-04-05 10:37:45     Starting to launch local task to process map join;      maximum memory = 1398145024
2012-04-05 10:37:48     Processing rows:        200000  Hashtable size: 199999  Memory usage:   75403880        rate:   0.054
2012-04-05 10:37:50     Processing rows:        300000  Hashtable size: 299999  Memory usage:   111404664       rate:   0.08
2012-04-05 10:37:54     Processing rows:        400000  Hashtable size: 399999  Memory usage:   151598960       rate:   0.108
2012-04-05 10:38:04     Processing rows:        500000  Hashtable size: 499999  Memory usage:   185483368       rate:   0.133
2012-04-05 10:38:09     Processing rows:        600000  Hashtable size: 599999  Memory usage:   221483392       rate:   0.158
2012-04-05 10:38:13     Processing rows:        700000  Hashtable size: 699999  Memory usage:   257482640       rate:   0.184
2012-04-05 10:38:19     Processing rows:        800000  Hashtable size: 799999  Memory usage:   297676944       rate:   0.213
2012-04-05 10:38:22     Processing rows:        900000  Hashtable size: 899999  Memory usage:   333676968       rate:   0.239
2012-04-05 10:38:27     Processing rows:        1000000 Hashtable size: 999999  Memory usage:   369676944       rate:   0.264
2012-04-05 10:38:31     Processing rows:        1100000 Hashtable size: 1099999 Memory usage:   405676968       rate:   0.29
2012-04-05 10:38:36     Processing rows:        1200000 Hashtable size: 1199999 Memory usage:   441676944       rate:   0.316
2012-04-05 10:38:42     Processing rows:        1300000 Hashtable size: 1299999 Memory usage:   477676944       rate:   0.342
2012-04-05 10:38:47     Processing rows:        1400000 Hashtable size: 1399999 Memory usage:   513676968       rate:   0.367
2012-04-05 10:38:52     Processing rows:        1500000 Hashtable size: 1499999 Memory usage:   549676944       rate:   0.393
2012-04-05 10:39:00     Processing rows:        1600000 Hashtable size: 1599999 Memory usage:   602454200       rate:   0.431
2012-04-05 10:39:08     Processing rows:        1700000 Hashtable size: 1699999 Memory usage:   630065552       rate:   0.451
2012-04-05 10:39:14     Processing rows:        1800000 Hashtable size: 1799999 Memory usage:   666065552       rate:   0.476
2012-04-05 10:39:20     Processing rows:        1900000 Hashtable size: 1899999 Memory usage:   702065552       rate:   0.502
2012-04-05 10:39:26     Processing rows:        2000000 Hashtable size: 1999999 Memory usage:   738065576       rate:   0.528
2012-04-05 10:39:36     Processing rows:        2100000 Hashtable size: 2099999 Memory usage:   774065552       rate:   0.554
2012-04-05 10:39:43     Processing rows:        2200000 Hashtable size: 2199999 Memory usage:   810065552       rate:   0.579
2012-04-05 10:39:51     Processing rows:        2300000 Hashtable size: 2299999 Memory usage:   846065576       rate:   0.605
2012-04-05 10:40:16     Processing rows:        2400000 Hashtable size: 2399999 Memory usage:   882085136       rate:   0.631
2012-04-05 10:40:24     Processing rows:        2500000 Hashtable size: 2499999 Memory usage:   918085208       rate:   0.657
2012-04-05 10:40:39     Processing rows:        2600000 Hashtable size: 2599999 Memory usage:   954065544       rate:   0.682
2012-04-05 10:40:48     Processing rows:        2700000 Hashtable size: 2699999 Memory usage:   990065568       rate:   0.708
2012-04-05 10:40:56     Processing rows:        2800000 Hashtable size: 2799999 Memory usage:   1026065552      rate:   0.734
2012-04-05 10:41:07     Processing rows:        2900000 Hashtable size: 2899999 Memory usage:   1062065576      rate:   0.76
Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
 
Best regards
Nguyen Thanh Binh (Mr)
Cell phone: (+84)98.226.0622
 
From:Nitin Pawar [mailto:nitinpawar432@gmail.com] 
Sent: Thursday, April 05, 2012 5:36 PM

To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much memory
 
can you try adding these settings 
set hive.enforce.bucketing=true;
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
 
I have tried bucketing with 1000 buckets and with more than 1TB data tables .. they do go through fine 
 
 
On Thu, Apr 5, 2012 at 3:37 PM, binhnt22 <Binhnt22@viettel.com.vn> wrote:
Hi Bejoy,
 
Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) and bucketized on ‘calling’ column into 10 buckets.
 
As you said, hive will load only 1 bucket ~ 180-190MB into memory. That’s hardly to blow the heap (1.3GB)
 
According to wiki, I set:
 
  set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
  set hive.optimize.bucketmapjoin = true;
  set hive.optimize.bucketmapjoin.sortedmerge = true;
 
And run the following SQL
 
select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b 
on (a.calling = b.calling) where  a.total_volume <> b.total_volume;
 
But it still created many hash tables then threw Java Heap space error
 
Best regards
Nguyen Thanh Binh (Mr)
Cell phone: (+84)98.226.0622
 
From:Bejoy Ks [mailto:bejoy_ks@yahoo.com] 
Sent: Thursday, April 05, 2012 3:07 PM
To: user@hive.apache.org

Subject: Re: Why BucketJoinMap consume too much memory
 
Hi Amit
 
      Sorry for the delayed response, had a terrible schedule. AFAIK, there is no flags that would help you to take the hash table creation, compression and load into tmp files away from client node. 
      From my understanding if you use a Map side join, the small table as a whole is converted into a hash table and compressed in a tmp file. Say if your child jvm size is 1gb and this small table is 5GB, it'd blow off jour job if the map tasks tries to get such a huge file in memory. Bucketed map join can help here, if the table is bucketed ,say 100 buckets then each bucket may have around 50mb of data. ie one tmp file would be just less that 50mb, here mapper needs to load only the required buckets in memory and thus hardly run into memory issues.
    Also on the client, The records are processed bucket by bucket and loaded into tmp files. So if your bucket size is too large, than the heap size specified for your client, it'd throw an out of memory.
 
Regards
Bejoy KS
 

________________________________
 
From:Amit Sharma <amitsharma1708@gmail.com>
To: user@hive.apache.org; Bejoy Ks <bejoy_ks@yahoo.com> 
Sent: Tuesday, April 3, 2012 11:06 PM
Subject: Re: Why BucketJoinMap consume too much memory
 
I am experiencing similar behavior in my queries. All the conditions for bucketed map join are met, and the only difference in execution when i set the hive.optimize.bucketmapjoin flag to true, is that instead of a single hash table, multiple hash tables are created. All the Hash Tables are still created on the client side and loaded into tmp files, which are then distributed to the mappers using distributed cache.

Can i find any example anywhere, which shows behavior of bucketed map join, where in it does not create the has tables on the client itself? If so, is there a flag for it?

Thanks,
Amit
On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <bejoy_ks@yahoo.com> wrote:
Hi
    On a first look, it seems like map join is happening in your case other than bucketed map join. The following conditions need to hold for bucketed map join to work
1) Both the tables are bucketed on the join columns
2) The number of buckets in each table should be multiples of each other
3) Ensure that the table has enough number of buckets 

Note: If the data is large say 1TB(per table) and if you have just a few buckets say 100 buckets, each mapper may have to load 10GB>. This would definitely blow your jvm . Bottom line is ensure your mappers are not heavily loaded with the bucketed data distribution.

Regards
Bejoy.K.S

________________________________
 
From:binhnt22 <Binhnt22@viettel.com.vn>
To: user@hive.apache.org 
Sent: Saturday, March 31, 2012 6:46 AM
Subject: Why BucketJoinMap consume too much memory
 
I  have 2 table, each has 6 million records and clustered into 10 buckets
 
These tables are very simple with 1 key column and 1 value column, all I want is getting the key that exists in both table but different value.
 
The normal did the trick, took only 141 secs.
 
select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where  a.total_volume <> b.total_volume;
 
I tried to use bucket join map by setting:   set hive.optimize.bucketmapjoin = true
 
select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where  a.total_volume <> b.total_volume;
 
2012-03-30 11:35:09     Starting to launch local task to process map join;      maximum memory = 1398145024
2012-03-30 11:35:12     Processing rows:        200000  Hashtable size: 199999  Memory usage:   86646704        rate:   0.062
2012-03-30 11:35:15     Processing rows:        300000  Hashtable size: 299999  Memory usage:   128247464       rate:   0.092
2012-03-30 11:35:18     Processing rows:        400000  Hashtable size: 399999  Memory usage:   174041744       rate:   0.124
2012-03-30 11:35:21     Processing rows:        500000  Hashtable size: 499999  Memory usage:   214140840       rate:   0.153
2012-03-30 11:35:25     Processing rows:        600000  Hashtable size: 599999  Memory usage:   255181504       rate:   0.183
2012-03-30 11:35:29     Processing rows:        700000  Hashtable size: 699999  Memory usage:   296744320       rate:   0.212
2012-03-30 11:35:35     Processing rows:        800000  Hashtable size: 799999  Memory usage:   342538616       rate:   0.245
2012-03-30 11:35:38     Processing rows:        900000  Hashtable size: 899999  Memory usage:   384138552       rate:   0.275
2012-03-30 11:35:45     Processing rows:        1000000 Hashtable size: 999999  Memory usage:   425719576       rate:   0.304
2012-03-30 11:35:50     Processing rows:        1100000 Hashtable size: 1099999 Memory usage:   467319576       rate:   0.334
2012-03-30 11:35:56     Processing rows:        1200000 Hashtable size: 1199999 Memory usage:   508940504       rate:   0.364
2012-03-30 11:36:04     Processing rows:        1300000 Hashtable size: 1299999 Memory usage:   550521128       rate:   0.394
2012-03-30 11:36:09     Processing rows:        1400000 Hashtable size: 1399999 Memory usage:   592121128       rate:   0.424
2012-03-30 11:36:15     Processing rows:        1500000 Hashtable size: 1499999 Memory usage:   633720336       rate:   0.453
2012-03-30 11:36:22     Processing rows:        1600000 Hashtable size: 1599999 Memory usage:   692097568       rate:   0.495
2012-03-30 11:36:33     Processing rows:        1700000 Hashtable size: 1699999 Memory usage:   725308944       rate:   0.519
2012-03-30 11:36:40     Processing rows:        1800000 Hashtable size: 1799999 Memory usage:   766946424       rate:   0.549
2012-03-30 11:36:48     Processing rows:        1900000 Hashtable size: 1899999 Memory usage:   808527928       rate:   0.578
2012-03-30 11:36:55     Processing rows:        2000000 Hashtable size: 1999999 Memory usage:   850127928       rate:   0.608
2012-03-30 11:37:08     Processing rows:        2100000 Hashtable size: 2099999 Memory usage:   891708856       rate:   0.638
2012-03-30 11:37:16     Processing rows:        2200000 Hashtable size: 2199999 Memory usage:   933308856       rate:   0.668
2012-03-30 11:37:25     Processing rows:        2300000 Hashtable size: 2299999 Memory usage:   974908856       rate:   0.697
2012-03-30 11:37:34     Processing rows:        2400000 Hashtable size: 2399999 Memory usage:   1016529448      rate:   0.727
2012-03-30 11:37:43     Processing rows:        2500000 Hashtable size: 2499999 Memory usage:   1058129496      rate:   0.757
2012-03-30 11:37:58     Processing rows:        2600000 Hashtable size: 2599999 Memory usage:   1099708832      rate:   0.787
Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
 
My system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them containts NameNode, JobTracker, Hive Server and all of them contain DataNode, TaskTracker
 
In all node, I set: export HADOOP_HEAPSIZE=1500  in hadoop-env.sh (~ 1.3GB heap)
 
I want to ask you experts, why bucket join map consume too much memory? Am I wrong or my configuration is bad?
 
Best regards,
 
 
 
 


 
-- 
Nitin Pawar


 
-- 
Nitin Pawar
Mime
View raw message