hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From binhnt22 <Binhn...@viettel.com.vn>
Subject RE: [Marketing Mail] Re: Why BucketJoinMap consume too much memory
Date Wed, 11 Apr 2012 01:36:05 GMT
Hi Ladda,

 

Your case is pretty simple, when you make table alias (a11, a12), you should use it in the hint MAPJOIN

 

That’s mean your sql should be look like:

 

select /*+ MAPJOIN(a11) */ a12.shipper_id, count(1), count (distinct a11.customer_id), sum(a11.qty_sold) from orderfactpartclust2 a12 join orderdetailpartclust2 a11 on (a11.order_id = a12.order_id) where (a11.order_date = '09-30-2008' and a12.order_date = '2008-09-30') group by a12.shipper_id;

 

Best regards

Nguyen Thanh Binh (Mr)

 

From: Ladda, Anand [mailto:lanand@microstrategy.com] 
Sent: Wednesday, April 11, 2012 3:23 AM
To: user@hive.apache.org
Subject: RE: [Marketing Mail] Re: Why BucketJoinMap consume too much memory

 

Hi Bejoy/Binh

Been following this thread to better understand where bucket map join would help and it’s been a great thread to follow. I have struggling with this on my end as well. 

 

I have two tables one of which is about 22GB (orderdetailpartclust2) in size and the other is 1.5GB (orderfactpartclust2) in size (all partitions combined) and I wanted to see the impact of different kind of joins on one of the partitions of these table . 

 

I created a partitioned (order_date) and bucketed (on order_id, on which I want to join these tables) version for these tables for this analysis. Data was loaded from their non-partitioned counterparts and setting the following parameters to ensure that data makes it into the right partitions and is bucketed correctly by Hive

 

set hive.exec.dynamic.partition.mode=nonstrict;

set hive.exec.dynamic.partition=true;

SET hive.exec.max.dynamic.partitions=100000;

SET hive.exec.max.dynamic.partitions.pernode=100000;

set hive.enforce.bucketing = true;

 

However when I try to do the following join query, I don’t get any bucketed map side join

 

select /*+ MAPJOIN(orderfactpartclust2) */ a12.shipper_id, count(1), count (distinct a11.customer_id), sum(a11.qty_sold) from orderfactpartclust2 a12 join orderdetailpartclust2 a11 on (a11.order_id = a12.order_id) where (a11.order_date = '09-30-2008' and a12.order_date = '2008-09-30') group by a12.shipper_id;

 

Below are the relevant pieces of information on each of these tables. Can you please help take a look to see what I might be missing to get map side joins. Is it because my tables are also partitioned that this isn’t working?

 

 

1.       hive> describe formatted orderdetailpartclust2;

OK

# col_name              data_type               comment

 

order_id                int                     from deserializer

item_id                 int                     from deserializer

emp_id                  int                     from deserializer

promotion_id            int                     from deserializer

customer_id             int                     from deserializer

qty_sold                float                   from deserializer

unit_price              float                   from deserializer

unit_cost               float                   from deserializer

discount                float                   from deserializer

 

# Partition Information

# col_name              data_type               comment

 

order_date              string                  None

 

# Detailed Table Information

Database:               default

Owner:                  hdfs

CreateTime:             Thu Apr 05 17:01:22 EDT 2012

LastAccessTime:         UNKNOWN

Protect Mode:           None

Retention:              0

Location:               hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2

Table Type:             MANAGED_TABLE

Table Parameters:

        SORTBUCKETCOLSPREFIX    TRUE

        numFiles                19200

        numPartitions           75

        numRows                 0

        totalSize               22814162038

        transient_lastDdlTime   1333725153

 

# Storage Information

SerDe Library:          org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe

InputFormat:            org.apache.hadoop.hive.ql.io.RCFileInputFormat

OutputFormat:           org.apache.hadoop.hive.ql.io.RCFileOutputFormat

Compressed:             No

Num Buckets:            256

Bucket Columns:         [order_id]

Sort Columns:           [Order(col:order_id, order:1)]

Storage Desc Params:

        escape.delim            \\

        field.delim             \t

        serialization.format    \t

Time taken: 3.255 seconds

2.       hive> describe formatted orderfactpartclust2;

OK

# col_name              data_type               comment

 

order_id                int                     from deserializer

emp_id                  int                     from deserializer

order_amt               float                   from deserializer

order_cost              float                   from deserializer

qty_sold                float                   from deserializer

freight                 float                   from deserializer

gross_dollar_sales      float                   from deserializer

ship_date               string                  from deserializer

rush_order              string                  from deserializer

customer_id             int                     from deserializer

pymt_type               int                     from deserializer

shipper_id              int                     from deserializer

 

# Partition Information

# col_name              data_type               comment

 

order_date              string                  None

 

# Detailed Table Information

Database:               default

Owner:                  hdfs

CreateTime:             Thu Apr 05 18:09:28 EDT 2012

LastAccessTime:         UNKNOWN

Protect Mode:           None

Retention:              0

Location:               hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2

Table Type:             MANAGED_TABLE

Table Parameters:

        SORTBUCKETCOLSPREFIX    TRUE

        numFiles                7680

        numPartitions           30

        numRows                 0

        totalSize               1528946078

        transient_lastDdlTime   1333722539

 

# Storage Information

SerDe Library:          org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe

InputFormat:            org.apache.hadoop.hive.ql.io.RCFileInputFormat

OutputFormat:           org.apache.hadoop.hive.ql.io.RCFileOutputFormat

Compressed:             No

Num Buckets:            256

Bucket Columns:         [order_id]

Sort Columns:           [Order(col:order_id, order:1)]

Storage Desc Params:

        escape.delim            \\

        field.delim             \t

        serialization.format    \t

Time taken: 1.737 seconds

 

3.       -bash-4.1$ hadoop fs -du /user/hive/warehouse/orderdetailpartclust2;

299867901   hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_date=01-01-2008

.

.

.

311033139   hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_date=09-30-2008

4.       -bash-4.1$ hadoop fs -du /user/hive/warehouse/orderdetailpartclust2/order_date=09-30-2008;

Found 256 items

1213444     hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_date=09-30-2008/000000_0

.

.

.

1213166     hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_date=09-30-2008/000255_0

-bash-4.1$

5.       -bash-4.1$ hadoop fs -du /user/hive/warehouse/orderfactpartclust2;

Found 30 items

50943109    hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=2008-09-01

.

.

.

50902368    hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=2008-09-30

6.       bash-4.1$ hadoop fs -du /user/hive/warehouse/orderfactpartclust2/order_date=2008-09-30;

Found 256 items

198692      hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=2008-09-30/000000_0

.

.

.

198954      hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=2008-09-30/000255_0

 

7.       -bash-4.1$ cat hive-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

 

<configuration>

 

<!-- Hive Configuration can either be stored in this file or in the hadoop configuration files  -->

<!-- that are implied by Hadoop setup variables.                                                -->

<!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive    -->

<!-- users do not have to edit hadoop configuration files (that may be managed as a centralized -->

<!-- resource).                                                                                 -->

 

<!-- Hive Execution Parameters -->

 

<property>

  <name>javax.jdo.option.ConnectionURL</name>

  <!-- jdbc:derby:/hadoophome/metastore_db;create=true -->

  <value>jdbc:derby://hadoop010:1527/;databaseName=metastore_db;create=true</value>

  <description>JDBC connect string for a JDBC metastore</description>

</property>

 

<property>

  <name>javax.jdo.option.ConnectionDriverName</name>

  <value>org.apache.derby.jdbc.EmbeddedDriver</value>

  <description>Driver class name for a JDBC metastore</description>

</property>

 

<property>

  <name>hive.hwi.war.file</name>

  <value>/usr/lib/hive/lib/hive-hwi-0.7.0-cdh3u0.war</value>

  <description>This is the WAR file with the jsp content for Hive Web Interface</description>

</property>

 

</configuration>

 

8.       Performing Join

 

hive> set hive.optimize.bucketmapjoin=true;

hive> set hive.enforce.bucketing=true;

hive> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

hive> select /*+ MAPJOIN(orderfactpartclust2) */ a12.shipper_id, count(1), count (distinct a11.customer_id), sum(a11.qty_sold) from orderfactpartclust2 a12 join orderdetailpartclust2 a11 on (a11.order_id = a12.order_id) where (a11.order_date = '09-30-2008' and a12.order_date = '2008-09-30') group by a12.shipper_id;

Total MapReduce jobs = 2

Launching Job 1 out of 2

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapred.reduce.tasks=<number>

Starting Job = job_201202131643_1294, Tracking URL = http://hadoop001:50030/jobdetails.jsp?jobid=job_201202131643_1294

Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=hadoop001:6932 -kill job_201202131643_1294

2012-04-10 16:15:06,663 Stage-1 map = 0%,  reduce = 0%

2012-04-10 16:15:08,671 Stage-1 map = 1%,  reduce = 0%

2012-04-10 16:15:09,675 Stage-1 map = 3%,  reduce = 0%

2012-04-10 16:15:10,679 Stage-1 map = 4%,  reduce = 0%

2012-04-10 16:15:11,683 Stage-1 map = 5%,  reduce = 0%

2012-04-10 16:15:12,688 Stage-1 map = 7%,  reduce = 0%

2012-04-10 16:15:13,692 Stage-1 map = 8%,  reduce = 0%

2012-04-10 16:15:14,697 Stage-1 map = 10%,  reduce = 0%

2012-04-10 16:15:15,756 Stage-1 map = 12%,  reduce = 0%

2012-04-10 16:15:16,761 Stage-1 map = 13%,  reduce = 0%

2012-04-10 16:15:17,767 Stage-1 map = 14%,  reduce = 0%

2012-04-10 16:15:18,773 Stage-1 map = 16%,  reduce = 0%

2012-04-10 16:15:19,778 Stage-1 map = 17%,  reduce = 1%

2012-04-10 16:15:20,784 Stage-1 map = 18%,  reduce = 1%

2012-04-10 16:15:21,789 Stage-1 map = 20%,  reduce = 1%

2012-04-10 16:15:22,795 Stage-1 map = 21%,  reduce = 5%

2012-04-10 16:15:23,800 Stage-1 map = 23%,  reduce = 5%

2012-04-10 16:15:24,805 Stage-1 map = 24%,  reduce = 5%

2012-04-10 16:15:25,936 Stage-1 map = 25%,  reduce = 8%

2012-04-10 16:15:26,941 Stage-1 map = 27%,  reduce = 8%

2012-04-10 16:15:27,947 Stage-1 map = 28%,  reduce = 8%

2012-04-10 16:15:28,951 Stage-1 map = 30%,  reduce = 8%

2012-04-10 16:15:29,956 Stage-1 map = 31%,  reduce = 8%

2012-04-10 16:15:30,981 Stage-1 map = 32%,  reduce = 8%

2012-04-10 16:15:31,987 Stage-1 map = 34%,  reduce = 8%

2012-04-10 16:15:32,992 Stage-1 map = 35%,  reduce = 10%

2012-04-10 16:15:33,998 Stage-1 map = 37%,  reduce = 10%

2012-04-10 16:15:35,003 Stage-1 map = 38%,  reduce = 10%

2012-04-10 16:15:36,055 Stage-1 map = 40%,  reduce = 10%

2012-04-10 16:15:37,097 Stage-1 map = 42%,  reduce = 10%

2012-04-10 16:15:38,102 Stage-1 map = 43%,  reduce = 10%

2012-04-10 16:15:39,108 Stage-1 map = 44%,  reduce = 10%

2012-04-10 16:15:40,113 Stage-1 map = 46%,  reduce = 10%

2012-04-10 16:15:41,123 Stage-1 map = 47%,  reduce = 10%

2012-04-10 16:15:42,128 Stage-1 map = 49%,  reduce = 15%

2012-04-10 16:15:43,134 Stage-1 map = 50%,  reduce = 15%

2012-04-10 16:15:44,139 Stage-1 map = 53%,  reduce = 15%

2012-04-10 16:15:46,152 Stage-1 map = 54%,  reduce = 15%

2012-04-10 16:15:47,158 Stage-1 map = 57%,  reduce = 15%

2012-04-10 16:15:48,164 Stage-1 map = 58%,  reduce = 15%

2012-04-10 16:15:49,171 Stage-1 map = 60%,  reduce = 15%

2012-04-10 16:15:50,176 Stage-1 map = 61%,  reduce = 15%

2012-04-10 16:15:51,182 Stage-1 map = 63%,  reduce = 19%

2012-04-10 16:15:52,199 Stage-1 map = 65%,  reduce = 19%

2012-04-10 16:15:53,222 Stage-1 map = 66%,  reduce = 19%

2012-04-10 16:15:54,228 Stage-1 map = 68%,  reduce = 19%

2012-04-10 16:15:55,234 Stage-1 map = 70%,  reduce = 19%

2012-04-10 16:15:56,241 Stage-1 map = 71%,  reduce = 19%

2012-04-10 16:15:57,248 Stage-1 map = 73%,  reduce = 21%

2012-04-10 16:15:58,253 Stage-1 map = 75%,  reduce = 21%

2012-04-10 16:15:59,260 Stage-1 map = 76%,  reduce = 21%

2012-04-10 16:16:00,267 Stage-1 map = 79%,  reduce = 21%

2012-04-10 16:16:01,273 Stage-1 map = 80%,  reduce = 21%

2012-04-10 16:16:02,280 Stage-1 map = 81%,  reduce = 21%

2012-04-10 16:16:03,287 Stage-1 map = 83%,  reduce = 27%

2012-04-10 16:16:04,294 Stage-1 map = 84%,  reduce = 27%

2012-04-10 16:16:05,302 Stage-1 map = 86%,  reduce = 27%

2012-04-10 16:16:06,310 Stage-1 map = 87%,  reduce = 27%

2012-04-10 16:16:07,317 Stage-1 map = 90%,  reduce = 27%

2012-04-10 16:16:08,325 Stage-1 map = 91%,  reduce = 27%

2012-04-10 16:16:09,332 Stage-1 map = 92%,  reduce = 27%

2012-04-10 16:16:10,339 Stage-1 map = 94%,  reduce = 27%

2012-04-10 16:16:11,348 Stage-1 map = 95%,  reduce = 27%

2012-04-10 16:16:12,355 Stage-1 map = 97%,  reduce = 29%

2012-04-10 16:16:13,362 Stage-1 map = 99%,  reduce = 29%

2012-04-10 16:16:14,370 Stage-1 map = 100%,  reduce = 29%

2012-04-10 16:16:18,396 Stage-1 map = 100%,  reduce = 32%

2012-04-10 16:16:24,654 Stage-1 map = 100%,  reduce = 67%

2012-04-10 16:16:27,683 Stage-1 map = 100%,  reduce = 70%

2012-04-10 16:16:30,701 Stage-1 map = 100%,  reduce = 73%

2012-04-10 16:16:33,719 Stage-1 map = 100%,  reduce = 77%

2012-04-10 16:16:36,739 Stage-1 map = 100%,  reduce = 80%

2012-04-10 16:16:39,781 Stage-1 map = 100%,  reduce = 84%

2012-04-10 16:16:42,806 Stage-1 map = 100%,  reduce = 88%

2012-04-10 16:16:45,824 Stage-1 map = 100%,  reduce = 92%

2012-04-10 16:16:48,840 Stage-1 map = 100%,  reduce = 97%

2012-04-10 16:16:50,854 Stage-1 map = 100%,  reduce = 100%

Ended Job = job_201202131643_1294

Launching Job 2 out of 2

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapred.reduce.tasks=<number>

Starting Job = job_201202131643_1295, Tracking URL = http://hadoop001:50030/jobdetails.jsp?jobid=job_201202131643_1295

Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=hadoop001:6932 -kill job_201202131643_1295

2012-04-10 16:16:56,693 Stage-2 map = 0%,  reduce = 0%

2012-04-10 16:17:02,716 Stage-2 map = 100%,  reduce = 0%

2012-04-10 16:17:12,759 Stage-2 map = 100%,  reduce = 100%

Ended Job = job_201202131643_1295

OK

1       678832  67850   678832.0

2       1360529 135253  1360529.0

3       4784635 460994  4784635.0

Time taken: 131.748 seconds

hive>

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com] 
Sent: Tuesday, April 10, 2012 11:44 AM
To: user@hive.apache.org
Subject: [Marketing Mail] Re: Why BucketJoinMap consume too much memory

 

Hi Binh

      You are right, here both of your tables are of the same size. And loading 2GB od data into hash tables and then to temp files and so on would take some time. This time becomes negligible if it was like, one table was of 2GB and other of 2TB, then you'll notice the wide difference in performance between a common join and bucketed map join.

      If one of the table is too small map join would be good, if it is of moderate size then bucketed map join. 

 

Regards

Bejoy KS

 

  _____  

From: binhnt22 <Binhnt22@viettel.com.vn>
To: user@hive.apache.org 
Cc: 'Bejoy Ks' <bejoy_ks@yahoo.com> 
Sent: Tuesday, April 10, 2012 8:10 AM
Subject: RE: Why BucketJoinMap consume too much memory

 

Hi Bejoy,

 

It worked like a charm. Thank you very much. I really really appreciate your help.

 

This bucket join should be used with 1 big table and 1 small table. 

 

If both table are big, the join time would be much more than normal join.

 

Best regards

Nguyen Thanh Binh (Mr)

Cell phone: (+84)98.226.0622

 

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com] 
Sent: Monday, April 09, 2012 9:49 PM
To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much memory

 

Hi Binh

     It is just an issue with the number of buckets. Your tables has just 8 buckets, as there only 8 files are seen the storage directory. You might have just issued an ALTER TABLE script on an existing bucketed table. The work around here is

 

1) You need to wipe and reload the tables with hive.enforce.bucketing=true;

       Ensure your storage directory as that many files as the number of buckets. As per your table DDL you should see 256 files.

 

2) Enable hive.optimize.bucketmapjoin = true; and try doing the join again.

 

It should definitely work.

 

Regards

Bejoy KS

 

  _____  

From: binhnt22 <Binhnt22@viettel.com.vn>
To: user@hive.apache.org 
Cc: 'Bejoy Ks' <bejoy_ks@yahoo.com> 
Sent: Monday, April 9, 2012 8:42 AM
Subject: RE: Why BucketJoinMap consume too much memory

 

Hi Bejoy,


Thank you for helping me. Here is the information 

 

1.      Describe Formatted ra_md_syn;

# col_name              data_type               comment

 

calling                 string                  None

total_duration          bigint                  None

total_volume            bigint                  None

total_charge            bigint                  None

 

# Detailed Table Information

Database:               default

Owner:                  hduser

CreateTime:             Thu Apr 05 09:48:29 ICT 2012

LastAccessTime:         UNKNOWN

Protect Mode:           None

Retention:              0

Location:               hdfs://master:54310/user/hive/warehouse/ra_md_syn

Table Type:             MANAGED_TABLE

Table Parameters:

        numFiles                8

        numPartitions           0

        numRows                 0

        rawDataSize             0

        totalSize               1872165483

        transient_lastDdlTime   1333595095

 

# Storage Information

SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat:            org.apache.hadoop.mapred.TextInputFormat

OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputForm                   at

Compressed:             No

Num Buckets:            256

Bucket Columns:         [calling]

Sort Columns:           []

Storage Desc Params:

        serialization.format    1

 

2.      Describe Formatted ra_ocs_syn;

# col_name              data_type               comment

 

calling                 string                  None

total_duration          bigint                  None

total_volume            bigint                  None

total_charge            bigint                  None

 

# Detailed Table Information

Database:               default

Owner:                  hduser

CreateTime:             Thu Apr 05 09:48:24 ICT 2012

LastAccessTime:         UNKNOWN

Protect Mode:           None

Retention:              0

Location:               hdfs://master:54310/user/hive/warehouse/ra_ocs_syn

Table Type:             MANAGED_TABLE

Table Parameters:

        numFiles                8

        numPartitions           0

        numRows                 0

        rawDataSize             0

        totalSize               1872162225

        transient_lastDdlTime   1333595512

 

# Storage Information

SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat:            org.apache.hadoop.mapred.TextInputFormat

OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Compressed:             No

Num Buckets:            256

Bucket Columns:         [calling]

Sort Columns:           []

Storage Desc Params:

        serialization.format    1

 

3.      hadoop fs -du <hdfs location of ra_md_syn >

[hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_md_syn

Found 8 items

280371407   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000000_1

280371407   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000001_0

274374970   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000002_1

274374970   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000003_1

269949415   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000004_0

262439067   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000005_0

205767721   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000006_0

24516526    hdfs://master:54310/user/hive/warehouse/ra_md_syn/000007_0

 

4.      hadoop fs -du <hdfs location of  ra_ocs_syn >

[hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_ocs_syn

Found 8 items

314639270   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000000_0

314639270   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000001_0

304959363   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000002_0

274374381   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000003_1

264694474   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000004_0

257498693   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000005_1

100334604   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000006_0

41022170    hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000007_0

 

5.      hive-site.xml

[hduser@master ~]$ cat hive-0.8.1/conf/hive-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

 

<configuration>

        <property>

                <name>hive.metastore.local</name>

                <value>true</value>

        </property>

        <property>

        <name>javax.jdo.option.ConnectionURL</name>

        <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>

        </property>

        <property>

        <name>javax.jdo.option.ConnectionDriverName</name>

        <value>com.mysql.jdbc.Driver</value>

        </property>

        <property>

        <name>javax.jdo.option.ConnectionUserName</name>

        <value>hadoop</value>

        </property>

        <property>

        <name>javax.jdo.option.ConnectionPassword</name>

        <value>hadoop</value>

        </property>

 

 

        <property>

          <name>hive.metastore.sasl.enabled</name>

          <value>true</value>

          <description>If true, the metastore thrift interface will be secured with SASL. Clients must authenticate with Kerberos.</description>

        </property>

 

        <property>

          <name>hive.metastore.kerberos.keytab.file</name>

          <value></value>

          <description>The path to the Kerberos Keytab file containing the metastore thrift server's service principal.</description>

        </property>

 

        <property>

          <name>hive.metastore.kerberos.principal</name>

          <value>hduser/admin@EXAMPLE.COM <mailto:hduser/admin@EXAMPLE.COM%3c/value> </value>

          <description>The service principal for the metastore thrift server. The special string _HOST will be replaced automatically with the correct host name.</description>

        </property>

</configuration>

 

6.      Performing JOIN

hive> set hive.optimize.bucketmapjoin = true;

hive> set hive.enforce.bucketing=true;

hive> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b

    > on (a.calling = b.calling) where  a.total_volume <> b.total_volume;

Total MapReduce jobs = 1

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.

Execution log at: /tmp/hduser/hduser_20120409013131_da40d787-597d-490c-8558-9d10ec11e916.log

2012-04-09 01:31:24     Starting to launch local task to process map join;      maximum memory = 13                           98145024

2012-04-09 01:31:28     Processing rows:        200000  Hashtable size: 199999  Memory usage:   754                           05504   rate:   0.054

2012-04-09 01:31:29     Processing rows:        300000  Hashtable size: 299999  Memory usage:   111                           540296  rate:   0.08

2012-04-09 01:31:32     Processing rows:        400000  Hashtable size: 399999  Memory usage:   151                           640080  rate:   0.108

2012-04-09 01:31:35     Processing rows:        500000  Hashtable size: 499999  Memory usage:   185                           503416  rate:   0.133

2012-04-09 01:31:37     Processing rows:        600000  Hashtable size: 599999  Memory usage:   221                           503440  rate:   0.158

2012-04-09 01:31:42     Processing rows:        700000  Hashtable size: 699999  Memory usage:   257                           484264  rate:   0.184

2012-04-09 01:31:47     Processing rows:        800000  Hashtable size: 799999  Memory usage:   297                           678568  rate:   0.213

2012-04-09 01:31:52     Processing rows:        900000  Hashtable size: 899999  Memory usage:   333                           678592  rate:   0.239

2012-04-09 01:31:57     Processing rows:        1000000 Hashtable size: 999999  Memory usage:   369                           678568  rate:   0.264

2012-04-09 01:32:03     Processing rows:        1100000 Hashtable size: 1099999 Memory usage:   405                           678568  rate:   0.29

2012-04-09 01:32:09     Processing rows:        1200000 Hashtable size: 1199999 Memory usage:   441                           678592  rate:   0.316

2012-04-09 01:32:15     Processing rows:        1300000 Hashtable size: 1299999 Memory usage:   477                           678568  rate:   0.342

2012-04-09 01:32:23     Processing rows:        1400000 Hashtable size: 1399999 Memory usage:   513                           678592  rate:   0.367

2012-04-09 01:32:29     Processing rows:        1500000 Hashtable size: 1499999 Memory usage:   549                           678568  rate:   0.393

2012-04-09 01:32:35     Processing rows:        1600000 Hashtable size: 1599999 Memory usage:   602                           455824  rate:   0.431

2012-04-09 01:32:45     Processing rows:        1700000 Hashtable size: 1699999 Memory usage:   630                           067176  rate:   0.451

2012-04-09 01:32:53     Processing rows:        1800000 Hashtable size: 1799999 Memory usage:   666                           067176  rate:   0.476

2012-04-09 01:33:01     Processing rows:        1900000 Hashtable size: 1899999 Memory usage:   702                           067200  rate:   0.502

2012-04-09 01:33:09     Processing rows:        2000000 Hashtable size: 1999999 Memory usage:   738                           067176  rate:   0.528

2012-04-09 01:33:20     Processing rows:        2100000 Hashtable size: 2099999 Memory usage:   774                           254456  rate:   0.554

2012-04-09 01:33:29     Processing rows:        2200000 Hashtable size: 2199999 Memory usage:   810                           067176  rate:   0.579

2012-04-09 01:33:38     Processing rows:        2300000 Hashtable size: 2299999 Memory usage:   846                           568480  rate:   0.605

2012-04-09 01:33:49     Processing rows:        2400000 Hashtable size: 2399999 Memory usage:   882                           096752  rate:   0.631

2012-04-09 01:33:59     Processing rows:        2500000 Hashtable size: 2499999 Memory usage:   918                           821920  rate:   0.657

2012-04-09 01:34:15     Processing rows:        2600000 Hashtable size: 2599999 Memory usage:   954                           134920  rate:   0.682

2012-04-09 01:34:26     Processing rows:        2700000 Hashtable size: 2699999 Memory usage:   990                           067168  rate:   0.708

2012-04-09 01:34:38     Processing rows:        2800000 Hashtable size: 2799999 Memory usage:   102                           7113288 rate:   0.735

Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space

        at java.util.Arrays.copyOf(Arrays.java:2882)

        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)

        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:597)

        at java.lang.StringBuilder.append(StringBuilder.java:212)

        at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:247)

        at org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:232)

Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space

        at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:315)

        at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:310)

        at java.util.jar.Manifest.read(Manifest.java:178)

        at java.util.jar.Manifest.<init>(Manifest.java:52)

        at java.util.jar.JarFile.getManifestFromReference(JarFile.java:167)

        at java.util.jar.JarFile.getManifest(JarFile.java:148)

        at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:696)

        at java.net.URLClassLoader.defineClass(URLClassLoader.java:228)

        at java.net.URLClassLoader.access$000(URLClassLoader.java:58)

        at java.net.URLClassLoader$1.run(URLClassLoader.java:197)

        at java.security.AccessController.doPrivileged(Native Method)

        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)

        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)

        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)

        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

        at org.apache.hadoop.util.RunJar$1.run(RunJar.java:126)

Execution failed with exit status: 2

Obtaining error information

 

Task failed!

Task ID:

  Stage-3

 

Logs:

 

/tmp/hduser/hive.log

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapredLocalTask

7.      /tmp/hduser/hive.log

2012-04-09 02:00:10,654 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.resources" but it cannot be resolved.

2012-04-09 02:00:10,654 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.resources" but it cannot be resolved.

2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.runtime" but it cannot be resolved.

2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.runtime" but it cannot be resolved.

2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.text" but it cannot be resolved.

2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.text" but it cannot be resolved.

2012-04-09 02:00:12,796 WARN  parse.SemanticAnalyzer (SemanticAnalyzer.java:genBodyPlan(5821)) - Common Gby keys:null

2012-04-09 02:09:02,356 ERROR exec.Task (SessionState.java:printError(380)) - Execution failed with exit status: 2

2012-04-09 02:09:02,357 ERROR exec.Task (SessionState.java:printError(380)) - Obtaining error information

2012-04-09 02:09:02,358 ERROR exec.Task (SessionState.java:printError(380)) -

Task failed!

Task ID:

  Stage-3

 

Logs:

 

2012-04-09 02:09:02,358 ERROR exec.Task (SessionState.java:printError(380)) - /tmp/hduser/hive.log

2012-04-09 02:09:02,359 ERROR exec.MapredLocalTask (MapredLocalTask.java:execute(228)) - Execution failed with exit status: 2

2012-04-09 02:09:02,377 ERROR ql.Driver (SessionState.java:printError(380)) - FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapredLocalTask

 

 

Best regards

Nguyen Thanh Binh (Mr)

Cell phone: (+84)98.226.0622

 

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com] 
Sent: Friday, April 06, 2012 11:33 PM
To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much memory

 

Hi Binh,

        From the information you provided bucketed map join should be posible. I'm clueless now, but still I can make one more try, if you could provide me the output of the following

 

1) Describe Formatted ra_md_syn;

2) Describe Formatted ra_ocs_syn.

 

3) hadoop fs -du <hdfs location of ra_ocs_syn >

4) hadoop fs -du <hdfs location of  ra_md_syn >

 

5) perform the join and paste the full console log along with the query. (with all the properties set at CLI)

 

6) your hive-site.xml

 

@Alex

       You can use non equality conditions in the where clause. Only the ON conditions should be equality ones.

 

 

Regards

Bejoy KS

 

 

  _____  

From: gemini alex <gemini5201314@gmail.com>
To: user@hive.apache.org 
Cc: Bejoy Ks <bejoy_ks@yahoo.com> 
Sent: Friday, April 6, 2012 12:36 PM
Subject: Re: Why BucketJoinMap consume too much memory

 

I guess the problem is you can't using <> predicate in bucket join, try to 

select c.* from (

select /*+ MAPJOIN(b) */ a.calling calling ,a. total_volume atotal_volume , b.total_volume btotal_volume from ra_md_syn a join ra_ocs_syn b

     on (a.calling = b.calling) ) c where c.atotal_volumn<>c.btotal_volume ;

 

 

 

在 2012年4月6日 上午9:19,binhnt22 <Binhnt22@viettel.com.vn>写道:

Hi Bejoy,

 

Sorry for late response. I will start to demonstrate over again to clear some information.

 

I have 2 tables, nearly same. Both has the same table structure, 65m records, 2GB size (same size).

hive> describe ra_md_syn;

OK

calling string

total_duration  bigint

total_volume    bigint

total_charge    bigint

 

Both of them were bucketized into 256 buckets on ‘calling’ column (in the last time only 10 buckets, I tried to increase it as you suggested). And I want to find all ‘calling’ exists in both tables but different ‘total_volume’

The script as you knew:

 

hive> set hive.optimize.bucketmapjoin = true;

hive> set hive.enforce.bucketing=true;

hive> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b

    > on (a.calling = b.calling) where  a.total_volume <> b.total_volume;

 

And the result was exactly in my last email. Java heap space error. With total size is only 2GB and 256 buckets, I think bucket size is impossible to be the issue here.

 

Please give me some advice, I really appreciate

Best regards

Nguyen Thanh Binh (Mr)

Cell phone: (+84)98.226.0622

 

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com] 
Sent: Thursday, April 05, 2012 7:23 PM


To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much memory

 

Hi Binh

 

    I was just checking your local map join log , and I noticed two things 

- the memory usage by one hash table has got beyond 1G. 

- Number of rows processed is just 2M

 

It is possible that, Each bucket it self is too large to be loaded in memory.

 

As a work around or to nail down the bucket size is the issue here, can you try increasing the number of buckets to 100 and try doing a bucketed map join.

 

Also you mentioned the data size is 2Gb, is it the compressed data size?

 

2012-04-05 10:41:07     Processing rows:        2,900,000 Hashtable size: 2899999 Memory usage:   1,062,065,576      rate:   0.76

 

Regards

Bejoy KS

 

 

 

  _____  

From: Nitin Pawar <nitinpawar432@gmail.com>
To: user@hive.apache.org 
Sent: Thursday, April 5, 2012 5:03 PM
Subject: Re: Why BucketJoinMap consume too much memory

 

Can you tell me the size of table b? 

 

If you are doing bucketing and still size b table is huge then it will reach this problem

On Thu, Apr 5, 2012 at 4:22 PM, binhnt22 <Binhnt22@viettel.com.vn> wrote:

Thank Nitin,

 

I tried but no luck. Here’s hive log, please spend a little time to view it.

 

hive> set hive.optimize.bucketmapjoin = true;

hive> set hive.enforce.bucketing=true;

hive> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b

    > on (a.calling = b.calling) where  a.total_volume <> b.total_volume;

Total MapReduce jobs = 1

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.

Execution log at: /tmp/hduser/hduser_20120405103737_28ef26fe-a202-4047-b5ca-c40d9e3ad36c.log

2012-04-05 10:37:45     Starting to launch local task to process map join;      maximum memory = 1398145024

2012-04-05 10:37:48     Processing rows:        200000  Hashtable size: 199999  Memory usage:   75403880        rate:   0.054

2012-04-05 10:37:50     Processing rows:        300000  Hashtable size: 299999  Memory usage:   111404664       rate:   0.08

2012-04-05 10:37:54     Processing rows:        400000  Hashtable size: 399999  Memory usage:   151598960       rate:   0.108

2012-04-05 10:38:04     Processing rows:        500000  Hashtable size: 499999  Memory usage:   185483368       rate:   0.133

2012-04-05 10:38:09     Processing rows:        600000  Hashtable size: 599999  Memory usage:   221483392       rate:   0.158

2012-04-05 10:38:13     Processing rows:        700000  Hashtable size: 699999  Memory usage:   257482640       rate:   0.184

2012-04-05 10:38:19     Processing rows:        800000  Hashtable size: 799999  Memory usage:   297676944       rate:   0.213

2012-04-05 10:38:22     Processing rows:        900000  Hashtable size: 899999  Memory usage:   333676968       rate:   0.239

2012-04-05 10:38:27     Processing rows:        1000000 Hashtable size: 999999  Memory usage:   369676944       rate:   0.264

2012-04-05 10:38:31     Processing rows:        1100000 Hashtable size: 1099999 Memory usage:   405676968       rate:   0.29

2012-04-05 10:38:36     Processing rows:        1200000 Hashtable size: 1199999 Memory usage:   441676944       rate:   0.316

2012-04-05 10:38:42     Processing rows:        1300000 Hashtable size: 1299999 Memory usage:   477676944       rate:   0.342

2012-04-05 10:38:47     Processing rows:        1400000 Hashtable size: 1399999 Memory usage:   513676968       rate:   0.367

2012-04-05 10:38:52     Processing rows:        1500000 Hashtable size: 1499999 Memory usage:   549676944       rate:   0.393

2012-04-05 10:39:00     Processing rows:        1600000 Hashtable size: 1599999 Memory usage:   602454200       rate:   0.431

2012-04-05 10:39:08     Processing rows:        1700000 Hashtable size: 1699999 Memory usage:   630065552       rate:   0.451

2012-04-05 10:39:14     Processing rows:        1800000 Hashtable size: 1799999 Memory usage:   666065552       rate:   0.476

2012-04-05 10:39:20     Processing rows:        1900000 Hashtable size: 1899999 Memory usage:   702065552       rate:   0.502

2012-04-05 10:39:26     Processing rows:        2000000 Hashtable size: 1999999 Memory usage:   738065576       rate:   0.528

2012-04-05 10:39:36     Processing rows:        2100000 Hashtable size: 2099999 Memory usage:   774065552       rate:   0.554

2012-04-05 10:39:43     Processing rows:        2200000 Hashtable size: 2199999 Memory usage:   810065552       rate:   0.579

2012-04-05 10:39:51     Processing rows:        2300000 Hashtable size: 2299999 Memory usage:   846065576       rate:   0.605

2012-04-05 10:40:16     Processing rows:        2400000 Hashtable size: 2399999 Memory usage:   882085136       rate:   0.631

2012-04-05 10:40:24     Processing rows:        2500000 Hashtable size: 2499999 Memory usage:   918085208       rate:   0.657

2012-04-05 10:40:39     Processing rows:        2600000 Hashtable size: 2599999 Memory usage:   954065544       rate:   0.682

2012-04-05 10:40:48     Processing rows:        2700000 Hashtable size: 2699999 Memory usage:   990065568       rate:   0.708

2012-04-05 10:40:56     Processing rows:        2800000 Hashtable size: 2799999 Memory usage:   1026065552      rate:   0.734

2012-04-05 10:41:07     Processing rows:        2900000 Hashtable size: 2899999 Memory usage:   1062065576      rate:   0.76

Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space

 

Best regards

Nguyen Thanh Binh (Mr)

Cell phone: (+84)98.226.0622

 

From: Nitin Pawar [mailto:nitinpawar432@gmail.com] 
Sent: Thursday, April 05, 2012 5:36 PM


To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much memory

 

can you try adding these settings 

set hive.enforce.bucketing=true;

hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

 

I have tried bucketing with 1000 buckets and with more than 1TB data tables .. they do go through fine 

 

 

On Thu, Apr 5, 2012 at 3:37 PM, binhnt22 <Binhnt22@viettel.com.vn> wrote:

Hi Bejoy,

 

Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) and bucketized on ‘calling’ column into 10 buckets.

 

As you said, hive will load only 1 bucket ~ 180-190MB into memory. That’s hardly to blow the heap (1.3GB)

 

According to wiki, I set:

 

  set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

  set hive.optimize.bucketmapjoin = true;

  set hive.optimize.bucketmapjoin.sortedmerge = true;

 

And run the following SQL

 

select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b 

on (a.calling = b.calling) where  a.total_volume <> b.total_volume;

 

But it still created many hash tables then threw Java Heap space error

 

Best regards

Nguyen Thanh Binh (Mr)

Cell phone: (+84)98.226.0622

 

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com] 
Sent: Thursday, April 05, 2012 3:07 PM
To: user@hive.apache.org


Subject: Re: Why BucketJoinMap consume too much memory

 

Hi Amit

 

      Sorry for the delayed response, had a terrible schedule. AFAIK, there is no flags that would help you to take the hash table creation, compression and load into tmp files away from client node. 

      From my understanding if you use a Map side join, the small table as a whole is converted into a hash table and compressed in a tmp file. Say if your child jvm size is 1gb and this small table is 5GB, it'd blow off jour job if the map tasks tries to get such a huge file in memory. Bucketed map join can help here, if the table is bucketed ,say 100 buckets then each bucket may have around 50mb of data. ie one tmp file would be just less that 50mb, here mapper needs to load only the required buckets in memory and thus hardly run into memory issues.

    Also on the client, The records are processed bucket by bucket and loaded into tmp files. So if your bucket size is too large, than the heap size specified for your client, it'd throw an out of memory.

 

Regards

Bejoy KS

 

  _____  

From: Amit Sharma <amitsharma1708@gmail.com>
To: user@hive.apache.org; Bejoy Ks <bejoy_ks@yahoo.com> 
Sent: Tuesday, April 3, 2012 11:06 PM
Subject: Re: Why BucketJoinMap consume too much memory

 

I am experiencing similar behavior in my queries. All the conditions for bucketed map join are met, and the only difference in execution when i set the hive.optimize.bucketmapjoin flag to true, is that instead of a single hash table, multiple hash tables are created. All the Hash Tables are still created on the client side and loaded into tmp files, which are then distributed to the mappers using distributed cache.

Can i find any example anywhere, which shows behavior of bucketed map join, where in it does not create the has tables on the client itself? If so, is there a flag for it?

Thanks,
Amit

On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <bejoy_ks@yahoo.com> wrote:

Hi
    On a first look, it seems like map join is happening in your case other than bucketed map join. The following conditions need to hold for bucketed map join to work
1) Both the tables are bucketed on the join columns
2) The number of buckets in each table should be multiples of each other
3) Ensure that the table has enough number of buckets 

Note: If the data is large say 1TB(per table) and if you have just a few buckets say 100 buckets, each mapper may have to load 10GB>. This would definitely blow your jvm . Bottom line is ensure your mappers are not heavily loaded with the bucketed data distribution.

Regards
Bejoy.K.S

  _____  

From: binhnt22 <Binhnt22@viettel.com.vn>
To: user@hive.apache.org 
Sent: Saturday, March 31, 2012 6:46 AM
Subject: Why BucketJoinMap consume too much memory

 

I  have 2 table, each has 6 million records and clustered into 10 buckets

 

These tables are very simple with 1 key column and 1 value column, all I want is getting the key that exists in both table but different value.

 

The normal did the trick, took only 141 secs.

 

select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where  a.total_volume <> b.total_volume;

 

I tried to use bucket join map by setting:   set hive.optimize.bucketmapjoin = true

 

select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where  a.total_volume <> b.total_volume;

 

2012-03-30 11:35:09     Starting to launch local task to process map join;      maximum memory = 1398145024

2012-03-30 11:35:12     Processing rows:        200000  Hashtable size: 199999  Memory usage:   86646704        rate:   0.062

2012-03-30 11:35:15     Processing rows:        300000  Hashtable size: 299999  Memory usage:   128247464       rate:   0.092

2012-03-30 11:35:18     Processing rows:        400000  Hashtable size: 399999  Memory usage:   174041744       rate:   0.124

2012-03-30 11:35:21     Processing rows:        500000  Hashtable size: 499999  Memory usage:   214140840       rate:   0.153

2012-03-30 11:35:25     Processing rows:        600000  Hashtable size: 599999  Memory usage:   255181504       rate:   0.183

2012-03-30 11:35:29     Processing rows:        700000  Hashtable size: 699999  Memory usage:   296744320       rate:   0.212

2012-03-30 11:35:35     Processing rows:        800000  Hashtable size: 799999  Memory usage:   342538616       rate:   0.245

2012-03-30 11:35:38     Processing rows:        900000  Hashtable size: 899999  Memory usage:   384138552       rate:   0.275

2012-03-30 11:35:45     Processing rows:        1000000 Hashtable size: 999999  Memory usage:   425719576       rate:   0.304

2012-03-30 11:35:50     Processing rows:        1100000 Hashtable size: 1099999 Memory usage:   467319576       rate:   0.334

2012-03-30 11:35:56     Processing rows:        1200000 Hashtable size: 1199999 Memory usage:   508940504       rate:   0.364

2012-03-30 11:36:04     Processing rows:        1300000 Hashtable size: 1299999 Memory usage:   550521128       rate:   0.394

2012-03-30 11:36:09     Processing rows:        1400000 Hashtable size: 1399999 Memory usage:   592121128       rate:   0.424

2012-03-30 11:36:15     Processing rows:        1500000 Hashtable size: 1499999 Memory usage:   633720336       rate:   0.453

2012-03-30 11:36:22     Processing rows:        1600000 Hashtable size: 1599999 Memory usage:   692097568       rate:   0.495

2012-03-30 11:36:33     Processing rows:        1700000 Hashtable size: 1699999 Memory usage:   725308944       rate:   0.519

2012-03-30 11:36:40     Processing rows:        1800000 Hashtable size: 1799999 Memory usage:   766946424       rate:   0.549

2012-03-30 11:36:48     Processing rows:        1900000 Hashtable size: 1899999 Memory usage:   808527928       rate:   0.578

2012-03-30 11:36:55     Processing rows:        2000000 Hashtable size: 1999999 Memory usage:   850127928       rate:   0.608

2012-03-30 11:37:08     Processing rows:        2100000 Hashtable size: 2099999 Memory usage:   891708856       rate:   0.638

2012-03-30 11:37:16     Processing rows:        2200000 Hashtable size: 2199999 Memory usage:   933308856       rate:   0.668

2012-03-30 11:37:25     Processing rows:        2300000 Hashtable size: 2299999 Memory usage:   974908856       rate:   0.697

2012-03-30 11:37:34     Processing rows:        2400000 Hashtable size: 2399999 Memory usage:   1016529448      rate:   0.727

2012-03-30 11:37:43     Processing rows:        2500000 Hashtable size: 2499999 Memory usage:   1058129496      rate:   0.757

2012-03-30 11:37:58     Processing rows:        2600000 Hashtable size: 2599999 Memory usage:   1099708832      rate:   0.787

Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space

 

My system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them containts NameNode, JobTracker, Hive Server and all of them contain DataNode, TaskTracker

 

In all node, I set: export HADOOP_HEAPSIZE=1500 in hadoop-env.sh (~ 1.3GB heap)

 

I want to ask you experts, why bucket join map consume too much memory? Am I wrong or my configuration is bad?

 

Best regards,

 

 

 

 





 

-- 
Nitin Pawar





 

-- 
Nitin Pawar

 

 

 

 

 


Mime
View raw message