hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From binhnt22 <Binhn...@viettel.com.vn>
Subject RE: Why BucketJoinMap consume too much memory
Date Tue, 10 Apr 2012 02:40:33 GMT
Hi Bejoy,

 

It worked like a charm. Thank you very much. I really really appreciate your help.

 

This bucket join should be used with 1 big table and 1 small table. 

 

If both table are big, the join time would be much more than normal join.

 

Best regards

Nguyen Thanh Binh (Mr)

Cell phone: (+84)98.226.0622

 

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com] 
Sent: Monday, April 09, 2012 9:49 PM
To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much memory

 

Hi Binh

     It is just an issue with the number of buckets. Your tables has just 8 buckets, as there
only 8 files are seen the storage directory. You might have just issued an ALTER TABLE script
on an existing bucketed table. The work around here is

 

1) You need to wipe and reload the tables with hive.enforce.bucketing=true;

       Ensure your storage directory as that many files as the number of buckets. As per your
table DDL you should see 256 files.

 

2) Enable hive.optimize.bucketmapjoin = true; and try doing the join again.

 

It should definitely work.

 

Regards

Bejoy KS

 

  _____  

From: binhnt22 <Binhnt22@viettel.com.vn>
To: user@hive.apache.org 
Cc: 'Bejoy Ks' <bejoy_ks@yahoo.com> 
Sent: Monday, April 9, 2012 8:42 AM
Subject: RE: Why BucketJoinMap consume too much memory





Hi Bejoy,


Thank you for helping me. Here is the information 

 

1.      Describe Formatted ra_md_syn;

# col_name              data_type               comment

 

calling                 string                  None

total_duration          bigint                  None

total_volume            bigint                  None

total_charge            bigint                  None

 

# Detailed Table Information

Database:               default

Owner:                  hduser

CreateTime:             Thu Apr 05 09:48:29 ICT 2012

LastAccessTime:         UNKNOWN

Protect Mode:           None

Retention:              0

Location:               hdfs://master:54310/user/hive/warehouse/ra_md_syn

Table Type:             MANAGED_TABLE

Table Parameters:

        numFiles                8

        numPartitions           0

        numRows                 0

        rawDataSize             0

        totalSize               1872165483

        transient_lastDdlTime   1333595095

 

# Storage Information

SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat:            org.apache.hadoop.mapred.TextInputFormat

OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputForm         
         at

Compressed:             No

Num Buckets:            256

Bucket Columns:         [calling]

Sort Columns:           []

Storage Desc Params:

        serialization.format    1

 

2.      Describe Formatted ra_ocs_syn;

# col_name              data_type               comment

 

calling                 string                  None

total_duration          bigint                  None

total_volume            bigint                  None

total_charge            bigint                  None

 

# Detailed Table Information

Database:               default

Owner:                  hduser

CreateTime:             Thu Apr 05 09:48:24 ICT 2012

LastAccessTime:         UNKNOWN

Protect Mode:           None

Retention:              0

Location:               hdfs://master:54310/user/hive/warehouse/ra_ocs_syn

Table Type:             MANAGED_TABLE

Table Parameters:

        numFiles                8

        numPartitions           0

        numRows                 0

        rawDataSize             0

        totalSize               1872162225

        transient_lastDdlTime   1333595512

 

# Storage Information

SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat:            org.apache.hadoop.mapred.TextInputFormat

OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Compressed:             No

Num Buckets:            256

Bucket Columns:         [calling]

Sort Columns:           []

Storage Desc Params:

        serialization.format    1

 

3.      hadoop fs -du <hdfs location of ra_md_syn >

[hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_md_syn

Found 8 items

280371407   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000000_1

280371407   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000001_0

274374970   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000002_1

274374970   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000003_1

269949415   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000004_0

262439067   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000005_0

205767721   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000006_0

24516526    hdfs://master:54310/user/hive/warehouse/ra_md_syn/000007_0

 

4.      hadoop fs -du <hdfs location of  ra_ocs_syn >

[hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_ocs_syn

Found 8 items

314639270   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000000_0

314639270   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000001_0

304959363   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000002_0

274374381   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000003_1

264694474   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000004_0

257498693   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000005_1

100334604   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000006_0

41022170    hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000007_0

 

5.      hive-site.xml

[hduser@master ~]$ cat hive-0.8.1/conf/hive-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

 

<configuration>

        <property>

                <name>hive.metastore.local</name>

                <value>true</value>

        </property>

        <property>

        <name>javax.jdo.option.ConnectionURL</name>

        <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>

        </property>

        <property>

        <name>javax.jdo.option.ConnectionDriverName</name>

        <value>com.mysql.jdbc.Driver</value>

        </property>

        <property>

        <name>javax.jdo.option.ConnectionUserName</name>

        <value>hadoop</value>

        </property>

        <property>

        <name>javax.jdo.option.ConnectionPassword</name>

        <value>hadoop</value>

        </property>

 

 

        <property>

          <name>hive.metastore.sasl.enabled</name>

          <value>true</value>

          <description>If true, the metastore thrift interface will be secured with
SASL. Clients must authenticate with Kerberos.</description>

        </property>

 

        <property>

          <name>hive.metastore.kerberos.keytab.file</name>

          <value></value>

          <description>The path to the Kerberos Keytab file containing the metastore
thrift server's service principal.</description>

        </property>

 

        <property>

          <name>hive.metastore.kerberos.principal</name>

          <value>hduser/admin@EXAMPLE.COM <mailto:hduser/admin@EXAMPLE.COM%3c/value>
</value>

          <description>The service principal for the metastore thrift server. The special
string _HOST will be replaced automatically with the correct host name.</description>

        </property>

</configuration>

 

6.      Performing JOIN

hive> set hive.optimize.bucketmapjoin = true;

hive> set hive.enforce.bucketing=true;

hive> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b

    > on (a.calling = b.calling) where  a.total_volume <> b.total_volume;

Total MapReduce jobs = 1

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter
in all the log4j.properties files.

Execution log at: /tmp/hduser/hduser_20120409013131_da40d787-597d-490c-8558-9d10ec11e916.log

2012-04-09 01:31:24     Starting to launch local task to process map join;      maximum memory
= 13                           98145024

2012-04-09 01:31:28     Processing rows:        200000  Hashtable size: 199999  Memory usage:
  754                           05504   rate:   0.054

2012-04-09 01:31:29     Processing rows:        300000  Hashtable size: 299999  Memory usage:
  111                           540296  rate:   0.08

2012-04-09 01:31:32     Processing rows:        400000  Hashtable size: 399999  Memory usage:
  151                           640080  rate:   0.108

2012-04-09 01:31:35     Processing rows:        500000  Hashtable size: 499999  Memory usage:
  185                           503416  rate:   0.133

2012-04-09 01:31:37     Processing rows:        600000  Hashtable size: 599999  Memory usage:
  221                           503440  rate:   0.158

2012-04-09 01:31:42     Processing rows:        700000  Hashtable size: 699999  Memory usage:
  257                           484264  rate:   0.184

2012-04-09 01:31:47     Processing rows:        800000  Hashtable size: 799999  Memory usage:
  297                           678568  rate:   0.213

2012-04-09 01:31:52     Processing rows:        900000  Hashtable size: 899999  Memory usage:
  333                           678592  rate:   0.239

2012-04-09 01:31:57     Processing rows:        1000000 Hashtable size: 999999  Memory usage:
  369                           678568  rate:   0.264

2012-04-09 01:32:03     Processing rows:        1100000 Hashtable size: 1099999 Memory usage:
  405                           678568  rate:   0.29

2012-04-09 01:32:09     Processing rows:        1200000 Hashtable size: 1199999 Memory usage:
  441                           678592  rate:   0.316

2012-04-09 01:32:15     Processing rows:        1300000 Hashtable size: 1299999 Memory usage:
  477                           678568  rate:   0.342

2012-04-09 01:32:23     Processing rows:        1400000 Hashtable size: 1399999 Memory usage:
  513                           678592  rate:   0.367

2012-04-09 01:32:29     Processing rows:        1500000 Hashtable size: 1499999 Memory usage:
  549                           678568  rate:   0.393

2012-04-09 01:32:35     Processing rows:        1600000 Hashtable size: 1599999 Memory usage:
  602                           455824  rate:   0.431

2012-04-09 01:32:45     Processing rows:        1700000 Hashtable size: 1699999 Memory usage:
  630                           067176  rate:   0.451

2012-04-09 01:32:53     Processing rows:        1800000 Hashtable size: 1799999 Memory usage:
  666                           067176  rate:   0.476

2012-04-09 01:33:01     Processing rows:        1900000 Hashtable size: 1899999 Memory usage:
  702                           067200  rate:   0.502

2012-04-09 01:33:09     Processing rows:        2000000 Hashtable size: 1999999 Memory usage:
  738                           067176  rate:   0.528

2012-04-09 01:33:20     Processing rows:        2100000 Hashtable size: 2099999 Memory usage:
  774                           254456  rate:   0.554

2012-04-09 01:33:29     Processing rows:        2200000 Hashtable size: 2199999 Memory usage:
  810                           067176  rate:   0.579

2012-04-09 01:33:38     Processing rows:        2300000 Hashtable size: 2299999 Memory usage:
  846                           568480  rate:   0.605

2012-04-09 01:33:49     Processing rows:        2400000 Hashtable size: 2399999 Memory usage:
  882                           096752  rate:   0.631

2012-04-09 01:33:59     Processing rows:        2500000 Hashtable size: 2499999 Memory usage:
  918                           821920  rate:   0.657

2012-04-09 01:34:15     Processing rows:        2600000 Hashtable size: 2599999 Memory usage:
  954                           134920  rate:   0.682

2012-04-09 01:34:26     Processing rows:        2700000 Hashtable size: 2699999 Memory usage:
  990                           067168  rate:   0.708

2012-04-09 01:34:38     Processing rows:        2800000 Hashtable size: 2799999 Memory usage:
  102                           7113288 rate:   0.735

Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space

        at java.util.Arrays.copyOf(Arrays.java:2882)

        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)

        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:597)

        at java.lang.StringBuilder.append(StringBuilder.java:212)

        at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:247)

        at org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:232)

Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space

        at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:315)

        at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:310)

        at java.util.jar.Manifest.read(Manifest.java:178)

        at java.util.jar.Manifest.<init>(Manifest.java:52)

        at java.util.jar.JarFile.getManifestFromReference(JarFile.java:167)

        at java.util.jar.JarFile.getManifest(JarFile.java:148)

        at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:696)

        at java.net.URLClassLoader.defineClass(URLClassLoader.java:228)

        at java.net.URLClassLoader.access$000(URLClassLoader.java:58)

        at java.net.URLClassLoader$1.run(URLClassLoader.java:197)

        at java.security.AccessController.doPrivileged(Native Method)

        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)

        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)

        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)

        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

        at org.apache.hadoop.util.RunJar$1.run(RunJar.java:126)

Execution failed with exit status: 2

Obtaining error information

 

Task failed!

Task ID:

  Stage-3

 

Logs:

 

/tmp/hduser/hive.log

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapredLocalTask

7.      /tmp/hduser/hive.log

2012-04-09 02:00:10,654 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core"
requires "org.eclipse.core.resources" but it cannot be resolved.

2012-04-09 02:00:10,654 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core"
requires "org.eclipse.core.resources" but it cannot be resolved.

2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core"
requires "org.eclipse.core.runtime" but it cannot be resolved.

2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core"
requires "org.eclipse.core.runtime" but it cannot be resolved.

2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core"
requires "org.eclipse.text" but it cannot be resolved.

2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core"
requires "org.eclipse.text" but it cannot be resolved.

2012-04-09 02:00:12,796 WARN  parse.SemanticAnalyzer (SemanticAnalyzer.java:genBodyPlan(5821))
- Common Gby keys:null

2012-04-09 02:09:02,356 ERROR exec.Task (SessionState.java:printError(380)) - Execution failed
with exit status: 2

2012-04-09 02:09:02,357 ERROR exec.Task (SessionState.java:printError(380)) - Obtaining error
information

2012-04-09 02:09:02,358 ERROR exec.Task (SessionState.java:printError(380)) -

Task failed!

Task ID:

  Stage-3

 

Logs:

 

2012-04-09 02:09:02,358 ERROR exec.Task (SessionState.java:printError(380)) - /tmp/hduser/hive.log

2012-04-09 02:09:02,359 ERROR exec.MapredLocalTask (MapredLocalTask.java:execute(228)) - Execution
failed with exit status: 2

2012-04-09 02:09:02,377 ERROR ql.Driver (SessionState.java:printError(380)) - FAILED: Execution
Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapredLocalTask

 

 

Best regards

Nguyen Thanh Binh (Mr)

Cell phone: (+84)98.226.0622

 

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com] 
Sent: Friday, April 06, 2012 11:33 PM
To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much memory

 

Hi Binh,

        From the information you provided bucketed map join should be posible. I'm clueless
now, but still I can make one more try, if you could provide me the output of the following

 

1) Describe Formatted ra_md_syn;

2) Describe Formatted ra_ocs_syn.

 

3) hadoop fs -du <hdfs location of ra_ocs_syn >

4) hadoop fs -du <hdfs location of  ra_md_syn >

 

5) perform the join and paste the full console log along with the query. (with all the properties
set at CLI)

 

6) your hive-site.xml

 

@Alex

       You can use non equality conditions in the where clause. Only the ON conditions should
be equality ones.

 

 

Regards

Bejoy KS

 

 

  _____  

From: gemini alex <gemini5201314@gmail.com>
To: user@hive.apache.org 
Cc: Bejoy Ks <bejoy_ks@yahoo.com> 
Sent: Friday, April 6, 2012 12:36 PM
Subject: Re: Why BucketJoinMap consume too much memory

 

I guess the problem is you can't using <> predicate in bucket join, try to 

select c.* from (

select /*+ MAPJOIN(b) */ a.calling calling ,a. total_volume atotal_volume , b.total_volume
btotal_volume from ra_md_syn a join ra_ocs_syn b

     on (a.calling = b.calling) ) c where c.atotal_volumn<>c.btotal_volume ;

 

 

 

在 2012年4月6日 上午9:19,binhnt22 <Binhnt22@viettel.com.vn>写道:

Hi Bejoy,

 

Sorry for late response. I will start to demonstrate over again to clear some information.

 

I have 2 tables, nearly same. Both has the same table structure, 65m records, 2GB size (same
size).

hive> describe ra_md_syn;

OK

calling string

total_duration  bigint

total_volume    bigint

total_charge    bigint

 

Both of them were bucketized into 256 buckets on ‘calling’ column (in the last time only
10 buckets, I tried to increase it as you suggested). And I want to find all ‘calling’
exists in both tables but different ‘total_volume’

The script as you knew:

 

hive> set hive.optimize.bucketmapjoin = true;

hive> set hive.enforce.bucketing=true;

hive> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b

    > on (a.calling = b.calling) where  a.total_volume <> b.total_volume;

 

And the result was exactly in my last email. Java heap space error. With total size is only
2GB and 256 buckets, I think bucket size is impossible to be the issue here.

 

Please give me some advice, I really appreciate

Best regards

Nguyen Thanh Binh (Mr)

Cell phone: (+84)98.226.0622

 

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com] 
Sent: Thursday, April 05, 2012 7:23 PM


To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much memory

 

Hi Binh

 

    I was just checking your local map join log , and I noticed two things 

- the memory usage by one hash table has got beyond 1G. 

- Number of rows processed is just 2M

 

It is possible that, Each bucket it self is too large to be loaded in memory.

 

As a work around or to nail down the bucket size is the issue here, can you try increasing
the number of buckets to 100 and try doing a bucketed map join.

 

Also you mentioned the data size is 2Gb, is it the compressed data size?

 

2012-04-05 10:41:07     Processing rows:        2,900,000 Hashtable size: 2899999 Memory usage:
  1,062,065,576      rate:   0.76

 

Regards

Bejoy KS

 

 

 

  _____  

From: Nitin Pawar <nitinpawar432@gmail.com>
To: user@hive.apache.org 
Sent: Thursday, April 5, 2012 5:03 PM
Subject: Re: Why BucketJoinMap consume too much memory

 

Can you tell me the size of table b? 

 

If you are doing bucketing and still size b table is huge then it will reach this problem

On Thu, Apr 5, 2012 at 4:22 PM, binhnt22 <Binhnt22@viettel.com.vn> wrote:

Thank Nitin,

 

I tried but no luck. Here’s hive log, please spend a little time to view it.

 

hive> set hive.optimize.bucketmapjoin = true;

hive> set hive.enforce.bucketing=true;

hive> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b

    > on (a.calling = b.calling) where  a.total_volume <> b.total_volume;

Total MapReduce jobs = 1

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter
in all the log4j.properties files.

Execution log at: /tmp/hduser/hduser_20120405103737_28ef26fe-a202-4047-b5ca-c40d9e3ad36c.log

2012-04-05 10:37:45     Starting to launch local task to process map join;      maximum memory
= 1398145024

2012-04-05 10:37:48     Processing rows:        200000  Hashtable size: 199999  Memory usage:
  75403880        rate:   0.054

2012-04-05 10:37:50     Processing rows:        300000  Hashtable size: 299999  Memory usage:
  111404664       rate:   0.08

2012-04-05 10:37:54     Processing rows:        400000  Hashtable size: 399999  Memory usage:
  151598960       rate:   0.108

2012-04-05 10:38:04     Processing rows:        500000  Hashtable size: 499999  Memory usage:
  185483368       rate:   0.133

2012-04-05 10:38:09     Processing rows:        600000  Hashtable size: 599999  Memory usage:
  221483392       rate:   0.158

2012-04-05 10:38:13     Processing rows:        700000  Hashtable size: 699999  Memory usage:
  257482640       rate:   0.184

2012-04-05 10:38:19     Processing rows:        800000  Hashtable size: 799999  Memory usage:
  297676944       rate:   0.213

2012-04-05 10:38:22     Processing rows:        900000  Hashtable size: 899999  Memory usage:
  333676968       rate:   0.239

2012-04-05 10:38:27     Processing rows:        1000000 Hashtable size: 999999  Memory usage:
  369676944       rate:   0.264

2012-04-05 10:38:31     Processing rows:        1100000 Hashtable size: 1099999 Memory usage:
  405676968       rate:   0.29

2012-04-05 10:38:36     Processing rows:        1200000 Hashtable size: 1199999 Memory usage:
  441676944       rate:   0.316

2012-04-05 10:38:42     Processing rows:        1300000 Hashtable size: 1299999 Memory usage:
  477676944       rate:   0.342

2012-04-05 10:38:47     Processing rows:        1400000 Hashtable size: 1399999 Memory usage:
  513676968       rate:   0.367

2012-04-05 10:38:52     Processing rows:        1500000 Hashtable size: 1499999 Memory usage:
  549676944       rate:   0.393

2012-04-05 10:39:00     Processing rows:        1600000 Hashtable size: 1599999 Memory usage:
  602454200       rate:   0.431

2012-04-05 10:39:08     Processing rows:        1700000 Hashtable size: 1699999 Memory usage:
  630065552       rate:   0.451

2012-04-05 10:39:14     Processing rows:        1800000 Hashtable size: 1799999 Memory usage:
  666065552       rate:   0.476

2012-04-05 10:39:20     Processing rows:        1900000 Hashtable size: 1899999 Memory usage:
  702065552       rate:   0.502

2012-04-05 10:39:26     Processing rows:        2000000 Hashtable size: 1999999 Memory usage:
  738065576       rate:   0.528

2012-04-05 10:39:36     Processing rows:        2100000 Hashtable size: 2099999 Memory usage:
  774065552       rate:   0.554

2012-04-05 10:39:43     Processing rows:        2200000 Hashtable size: 2199999 Memory usage:
  810065552       rate:   0.579

2012-04-05 10:39:51     Processing rows:        2300000 Hashtable size: 2299999 Memory usage:
  846065576       rate:   0.605

2012-04-05 10:40:16     Processing rows:        2400000 Hashtable size: 2399999 Memory usage:
  882085136       rate:   0.631

2012-04-05 10:40:24     Processing rows:        2500000 Hashtable size: 2499999 Memory usage:
  918085208       rate:   0.657

2012-04-05 10:40:39     Processing rows:        2600000 Hashtable size: 2599999 Memory usage:
  954065544       rate:   0.682

2012-04-05 10:40:48     Processing rows:        2700000 Hashtable size: 2699999 Memory usage:
  990065568       rate:   0.708

2012-04-05 10:40:56     Processing rows:        2800000 Hashtable size: 2799999 Memory usage:
  1026065552      rate:   0.734

2012-04-05 10:41:07     Processing rows:        2900000 Hashtable size: 2899999 Memory usage:
  1062065576      rate:   0.76

Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space

 

Best regards

Nguyen Thanh Binh (Mr)

Cell phone: (+84)98.226.0622

 

From: Nitin Pawar [mailto:nitinpawar432@gmail.com] 
Sent: Thursday, April 05, 2012 5:36 PM


To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much memory

 

can you try adding these settings 

set hive.enforce.bucketing=true;

hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

 

I have tried bucketing with 1000 buckets and with more than 1TB data tables .. they do go
through fine 

 

 

On Thu, Apr 5, 2012 at 3:37 PM, binhnt22 <Binhnt22@viettel.com.vn> wrote:

Hi Bejoy,

 

Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) and bucketized on ‘calling’ column
into 10 buckets.

 

As you said, hive will load only 1 bucket ~ 180-190MB into memory. That’s hardly to blow
the heap (1.3GB)

 

According to wiki, I set:

 

  set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

  set hive.optimize.bucketmapjoin = true;

  set hive.optimize.bucketmapjoin.sortedmerge = true;

 

And run the following SQL

 

select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic
b 

on (a.calling = b.calling) where  a.total_volume <> b.total_volume;

 

But it still created many hash tables then threw Java Heap space error

 

Best regards

Nguyen Thanh Binh (Mr)

Cell phone: (+84)98.226.0622

 

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com] 
Sent: Thursday, April 05, 2012 3:07 PM
To: user@hive.apache.org


Subject: Re: Why BucketJoinMap consume too much memory

 

Hi Amit

 

      Sorry for the delayed response, had a terrible schedule. AFAIK, there is no flags that
would help you to take the hash table creation, compression and load into tmp files away from
client node. 

      From my understanding if you use a Map side join, the small table as a whole is converted
into a hash table and compressed in a tmp file. Say if your child jvm size is 1gb and this
small table is 5GB, it'd blow off jour job if the map tasks tries to get such a huge file
in memory. Bucketed map join can help here, if the table is bucketed ,say 100 buckets then
each bucket may have around 50mb of data. ie one tmp file would be just less that 50mb, here
mapper needs to load only the required buckets in memory and thus hardly run into memory issues.

    Also on the client, The records are processed bucket by bucket and loaded into tmp files.
So if your bucket size is too large, than the heap size specified for your client, it'd throw
an out of memory.

 

Regards

Bejoy KS

 

  _____  

From: Amit Sharma <amitsharma1708@gmail.com>
To: user@hive.apache.org; Bejoy Ks <bejoy_ks@yahoo.com> 
Sent: Tuesday, April 3, 2012 11:06 PM
Subject: Re: Why BucketJoinMap consume too much memory

 

I am experiencing similar behavior in my queries. All the conditions for bucketed map join
are met, and the only difference in execution when i set the hive.optimize.bucketmapjoin flag
to true, is that instead of a single hash table, multiple hash tables are created. All the
Hash Tables are still created on the client side and loaded into tmp files, which are then
distributed to the mappers using distributed cache.

Can i find any example anywhere, which shows behavior of bucketed map join, where in it does
not create the has tables on the client itself? If so, is there a flag for it?

Thanks,
Amit

On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <bejoy_ks@yahoo.com> wrote:

Hi
    On a first look, it seems like map join is happening in your case other than bucketed
map join. The following conditions need to hold for bucketed map join to work
1) Both the tables are bucketed on the join columns
2) The number of buckets in each table should be multiples of each other
3) Ensure that the table has enough number of buckets 

Note: If the data is large say 1TB(per table) and if you have just a few buckets say 100 buckets,
each mapper may have to load 10GB>. This would definitely blow your jvm . Bottom line is
ensure your mappers are not heavily loaded with the bucketed data distribution.

Regards
Bejoy.K.S

  _____  

From: binhnt22 <Binhnt22@viettel.com.vn>
To: user@hive.apache.org 
Sent: Saturday, March 31, 2012 6:46 AM
Subject: Why BucketJoinMap consume too much memory

 

I  have 2 table, each has 6 million records and clustered into 10 buckets

 

These tables are very simple with 1 key column and 1 value column, all I want is getting the
key that exists in both table but different value.

 

The normal did the trick, took only 141 secs.

 

select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling =
b.calling) where  a.total_volume <> b.total_volume;

 

I tried to use bucket join map by setting:   set hive.optimize.bucketmapjoin = true

 

select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic
b on (a.calling = b.calling) where  a.total_volume <> b.total_volume;

 

2012-03-30 11:35:09     Starting to launch local task to process map join;      maximum memory
= 1398145024

2012-03-30 11:35:12     Processing rows:        200000  Hashtable size: 199999  Memory usage:
  86646704        rate:   0.062

2012-03-30 11:35:15     Processing rows:        300000  Hashtable size: 299999  Memory usage:
  128247464       rate:   0.092

2012-03-30 11:35:18     Processing rows:        400000  Hashtable size: 399999  Memory usage:
  174041744       rate:   0.124

2012-03-30 11:35:21     Processing rows:        500000  Hashtable size: 499999  Memory usage:
  214140840       rate:   0.153

2012-03-30 11:35:25     Processing rows:        600000  Hashtable size: 599999  Memory usage:
  255181504       rate:   0.183

2012-03-30 11:35:29     Processing rows:        700000  Hashtable size: 699999  Memory usage:
  296744320       rate:   0.212

2012-03-30 11:35:35     Processing rows:        800000  Hashtable size: 799999  Memory usage:
  342538616       rate:   0.245

2012-03-30 11:35:38     Processing rows:        900000  Hashtable size: 899999  Memory usage:
  384138552       rate:   0.275

2012-03-30 11:35:45     Processing rows:        1000000 Hashtable size: 999999  Memory usage:
  425719576       rate:   0.304

2012-03-30 11:35:50     Processing rows:        1100000 Hashtable size: 1099999 Memory usage:
  467319576       rate:   0.334

2012-03-30 11:35:56     Processing rows:        1200000 Hashtable size: 1199999 Memory usage:
  508940504       rate:   0.364

2012-03-30 11:36:04     Processing rows:        1300000 Hashtable size: 1299999 Memory usage:
  550521128       rate:   0.394

2012-03-30 11:36:09     Processing rows:        1400000 Hashtable size: 1399999 Memory usage:
  592121128       rate:   0.424

2012-03-30 11:36:15     Processing rows:        1500000 Hashtable size: 1499999 Memory usage:
  633720336       rate:   0.453

2012-03-30 11:36:22     Processing rows:        1600000 Hashtable size: 1599999 Memory usage:
  692097568       rate:   0.495

2012-03-30 11:36:33     Processing rows:        1700000 Hashtable size: 1699999 Memory usage:
  725308944       rate:   0.519

2012-03-30 11:36:40     Processing rows:        1800000 Hashtable size: 1799999 Memory usage:
  766946424       rate:   0.549

2012-03-30 11:36:48     Processing rows:        1900000 Hashtable size: 1899999 Memory usage:
  808527928       rate:   0.578

2012-03-30 11:36:55     Processing rows:        2000000 Hashtable size: 1999999 Memory usage:
  850127928       rate:   0.608

2012-03-30 11:37:08     Processing rows:        2100000 Hashtable size: 2099999 Memory usage:
  891708856       rate:   0.638

2012-03-30 11:37:16     Processing rows:        2200000 Hashtable size: 2199999 Memory usage:
  933308856       rate:   0.668

2012-03-30 11:37:25     Processing rows:        2300000 Hashtable size: 2299999 Memory usage:
  974908856       rate:   0.697

2012-03-30 11:37:34     Processing rows:        2400000 Hashtable size: 2399999 Memory usage:
  1016529448      rate:   0.727

2012-03-30 11:37:43     Processing rows:        2500000 Hashtable size: 2499999 Memory usage:
  1058129496      rate:   0.757

2012-03-30 11:37:58     Processing rows:        2600000 Hashtable size: 2599999 Memory usage:
  1099708832      rate:   0.787

Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space

 

My system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them containts NameNode,
JobTracker, Hive Server and all of them contain DataNode, TaskTracker

 

In all node, I set: export HADOOP_HEAPSIZE=1500 in hadoop-env.sh (~ 1.3GB heap)

 

I want to ask you experts, why bucket join map consume too much memory? Am I wrong or my configuration
is bad?

 

Best regards,

 

 

 

 





 

-- 
Nitin Pawar





 

-- 
Nitin Pawar

 

 

 

 


Mime
View raw message