Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AE1989F7C for ; Wed, 11 Apr 2012 01:36:44 +0000 (UTC) Received: (qmail 47980 invoked by uid 500); 11 Apr 2012 01:36:43 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 47931 invoked by uid 500); 11 Apr 2012 01:36:43 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 47919 invoked by uid 99); 11 Apr 2012 01:36:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Apr 2012 01:36:42 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS,T_FILL_THIS_FORM_SHORT,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of Binhnt22@viettel.com.vn designates 203.113.131.24 as permitted sender) Received: from [203.113.131.24] (HELO mailfilter-out-01.viettel.com.vn) (203.113.131.24) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Apr 2012 01:36:24 +0000 Received: from mta-out-backup.viettel.com.vn (smtp.viettel.com.vn [203.113.131.16]) by mailfilter-out-01.viettel.com.vn with ESMTP id q3B1ZsOk003167 for ; Wed, 11 Apr 2012 08:35:54 +0700 Received: from KDVTBINHNT22 ([220.231.123.17]) by mta-out-backup.viettel.com.vn (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTPSA id <0M2A005CYKFT6432@mta-out-backup.viettel.com.vn> for user@hive.apache.org; Wed, 11 Apr 2012 08:35:54 +0700 (ICT) Date: Wed, 11 Apr 2012 08:36:05 +0700 From: binhnt22 Subject: RE: [Marketing Mail] Re: Why BucketJoinMap consume too much memory In-reply-to: <3585CBC0139BC8459E2239E01D0B50F802E4988E@PROD-EXCH-M1.corp.microstrategy.com> Sender: Binhnt22@viettel.com.vn To: user@hive.apache.org Message-id: <002401cd1783$76e842e0$64b8c8a0$@com.vn> MIME-version: 1.0 X-Mailer: Microsoft Office Outlook 12.0 Content-type: multipart/alternative; boundary="Boundary_(ID_Xs6DU59ehoTRUXpe5DR+0g)" Content-language: en-us Thread-index: AQHNFf6fcOnjvvAFRkWVAX+9nFOwWJaS1r0AgADG7YCAANrTgP//0YfAgACO65A= References: <00ee01cd0edb$e1e39eb0$a5aadc10$@com.vn> <1333308918.82113.YahooMailNeo@web121202.mail.ne1.yahoo.com> <1333613229.2249.YahooMailNeo@web121202.mail.ne1.yahoo.com> <002701cd1313$eceaf560$c6c0e020$@com.vn> <003a01cd131a$2f1a1b90$8d4e52b0$@com.vn> <1333628555.67543.YahooMailNeo@web121201.mail.ne1.yahoo.com> <003f01cd1393$5dd14b10$1973e130$@com.vn> <1333729985.29542.YahooMailNeo@web121205.mail.ne1.yahoo.com> <005201cd15fe$8c7c8660$a5759320$@com.vn> <1333982914.96478.YahooMailNeo@web121202.mail.ne1.yahoo.com> <003901cd16c3$4d4406b0$e7cc1410$@com.vn> <1334072625.34211.YahooMailNeo@web121206.mail.ne1.yahoo.com> <3585CBC0139BC8459E2239E01D0B50F802E4988E@PROD-EXCH-M1.corp.microstrategy.com> This is a multi-part message in MIME format. --Boundary_(ID_Xs6DU59ehoTRUXpe5DR+0g) Content-type: text/plain; charset=utf-8 Content-transfer-encoding: quoted-printable Hi Ladda, =20 Your case is pretty simple, when you make table alias (a11, a12), you = should use it in the hint MAPJOIN =20 That=E2=80=99s mean your sql should be look like: =20 select /*+ MAPJOIN(a11) */ a12.shipper_id, count(1), count (distinct = a11.customer_id), sum(a11.qty_sold) from orderfactpartclust2 a12 join = orderdetailpartclust2 a11 on (a11.order_id =3D a12.order_id) where = (a11.order_date =3D '09-30-2008' and a12.order_date =3D '2008-09-30') = group by a12.shipper_id; =20 Best regards Nguyen Thanh Binh (Mr) =20 From: Ladda, Anand [mailto:lanand@microstrategy.com]=20 Sent: Wednesday, April 11, 2012 3:23 AM To: user@hive.apache.org Subject: RE: [Marketing Mail] Re: Why BucketJoinMap consume too much = memory =20 Hi Bejoy/Binh Been following this thread to better understand where bucket map join = would help and it=E2=80=99s been a great thread to follow. I have = struggling with this on my end as well.=20 =20 I have two tables one of which is about 22GB (orderdetailpartclust2) in = size and the other is 1.5GB (orderfactpartclust2) in size (all = partitions combined) and I wanted to see the impact of different kind of = joins on one of the partitions of these table .=20 =20 I created a partitioned (order_date) and bucketed (on order_id, on which = I want to join these tables) version for these tables for this analysis. = Data was loaded from their non-partitioned counterparts and setting the = following parameters to ensure that data makes it into the right = partitions and is bucketed correctly by Hive =20 set hive.exec.dynamic.partition.mode=3Dnonstrict; set hive.exec.dynamic.partition=3Dtrue; SET hive.exec.max.dynamic.partitions=3D100000; SET hive.exec.max.dynamic.partitions.pernode=3D100000; set hive.enforce.bucketing =3D true; =20 However when I try to do the following join query, I don=E2=80=99t get = any bucketed map side join =20 select /*+ MAPJOIN(orderfactpartclust2) */ a12.shipper_id, count(1), = count (distinct a11.customer_id), sum(a11.qty_sold) from = orderfactpartclust2 a12 join orderdetailpartclust2 a11 on (a11.order_id = =3D a12.order_id) where (a11.order_date =3D '09-30-2008' and = a12.order_date =3D '2008-09-30') group by a12.shipper_id; =20 Below are the relevant pieces of information on each of these tables. = Can you please help take a look to see what I might be missing to get = map side joins. Is it because my tables are also partitioned that this = isn=E2=80=99t working? =20 =20 1. hive> describe formatted orderdetailpartclust2; OK # col_name data_type comment =20 order_id int from deserializer item_id int from deserializer emp_id int from deserializer promotion_id int from deserializer customer_id int from deserializer qty_sold float from deserializer unit_price float from deserializer unit_cost float from deserializer discount float from deserializer =20 # Partition Information # col_name data_type comment =20 order_date string None =20 # Detailed Table Information Database: default Owner: hdfs CreateTime: Thu Apr 05 17:01:22 EDT 2012 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: = hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2 Table Type: MANAGED_TABLE Table Parameters: SORTBUCKETCOLSPREFIX TRUE numFiles 19200 numPartitions 75 numRows 0 totalSize 22814162038 transient_lastDdlTime 1333725153 =20 # Storage Information SerDe Library: = org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe InputFormat: org.apache.hadoop.hive.ql.io.RCFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.RCFileOutputFormat Compressed: No Num Buckets: 256 Bucket Columns: [order_id] Sort Columns: [Order(col:order_id, order:1)] Storage Desc Params: escape.delim \\ field.delim \t serialization.format \t Time taken: 3.255 seconds 2. hive> describe formatted orderfactpartclust2; OK # col_name data_type comment =20 order_id int from deserializer emp_id int from deserializer order_amt float from deserializer order_cost float from deserializer qty_sold float from deserializer freight float from deserializer gross_dollar_sales float from deserializer ship_date string from deserializer rush_order string from deserializer customer_id int from deserializer pymt_type int from deserializer shipper_id int from deserializer =20 # Partition Information # col_name data_type comment =20 order_date string None =20 # Detailed Table Information Database: default Owner: hdfs CreateTime: Thu Apr 05 18:09:28 EDT 2012 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: = hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2 Table Type: MANAGED_TABLE Table Parameters: SORTBUCKETCOLSPREFIX TRUE numFiles 7680 numPartitions 30 numRows 0 totalSize 1528946078 transient_lastDdlTime 1333722539 =20 # Storage Information SerDe Library: = org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe InputFormat: org.apache.hadoop.hive.ql.io.RCFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.RCFileOutputFormat Compressed: No Num Buckets: 256 Bucket Columns: [order_id] Sort Columns: [Order(col:order_id, order:1)] Storage Desc Params: escape.delim \\ field.delim \t serialization.format \t Time taken: 1.737 seconds =20 3. -bash-4.1$ hadoop fs -du = /user/hive/warehouse/orderdetailpartclust2; 299867901 = hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_dat= e=3D01-01-2008 . . . 311033139 = hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_dat= e=3D09-30-2008 4. -bash-4.1$ hadoop fs -du = /user/hive/warehouse/orderdetailpartclust2/order_date=3D09-30-2008; Found 256 items 1213444 = hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_dat= e=3D09-30-2008/000000_0 . . . 1213166 = hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_dat= e=3D09-30-2008/000255_0 -bash-4.1$ 5. -bash-4.1$ hadoop fs -du = /user/hive/warehouse/orderfactpartclust2; Found 30 items 50943109 = hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=3D= 2008-09-01 . . . 50902368 = hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=3D= 2008-09-30 6. bash-4.1$ hadoop fs -du = /user/hive/warehouse/orderfactpartclust2/order_date=3D2008-09-30; Found 256 items 198692 = hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=3D= 2008-09-30/000000_0 . . . 198954 = hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=3D= 2008-09-30/000255_0 =20 7. -bash-4.1$ cat hive-site.xml =20 =20 =20 =20 javax.jdo.option.ConnectionURL = jdbc:derby://hadoop010:1527/;databaseName=3Dmetastore_db;create=3D= true JDBC connect string for a JDBC metastore =20 javax.jdo.option.ConnectionDriverName org.apache.derby.jdbc.EmbeddedDriver Driver class name for a JDBC metastore =20 hive.hwi.war.file /usr/lib/hive/lib/hive-hwi-0.7.0-cdh3u0.war This is the WAR file with the jsp content for Hive Web = Interface =20 =20 8. Performing Join =20 hive> set hive.optimize.bucketmapjoin=3Dtrue; hive> set hive.enforce.bucketing=3Dtrue; hive> set = hive.input.format=3Dorg.apache.hadoop.hive.ql.io.BucketizedHiveInputForma= t; hive> select /*+ MAPJOIN(orderfactpartclust2) */ a12.shipper_id, = count(1), count (distinct a11.customer_id), sum(a11.qty_sold) from = orderfactpartclust2 a12 join orderdetailpartclust2 a11 on (a11.order_id = =3D a12.order_id) where (a11.order_date =3D '09-30-2008' and = a12.order_date =3D '2008-09-30') group by a12.shipper_id; Total MapReduce jobs =3D 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=3D In order to limit the maximum number of reducers: set hive.exec.reducers.max=3D In order to set a constant number of reducers: set mapred.reduce.tasks=3D Starting Job =3D job_201202131643_1294, Tracking URL =3D = http://hadoop001:50030/jobdetails.jsp?jobid=3Djob_201202131643_1294 Kill Command =3D /usr/lib/hadoop/bin/hadoop job = -Dmapred.job.tracker=3Dhadoop001:6932 -kill job_201202131643_1294 2012-04-10 16:15:06,663 Stage-1 map =3D 0%, reduce =3D 0% 2012-04-10 16:15:08,671 Stage-1 map =3D 1%, reduce =3D 0% 2012-04-10 16:15:09,675 Stage-1 map =3D 3%, reduce =3D 0% 2012-04-10 16:15:10,679 Stage-1 map =3D 4%, reduce =3D 0% 2012-04-10 16:15:11,683 Stage-1 map =3D 5%, reduce =3D 0% 2012-04-10 16:15:12,688 Stage-1 map =3D 7%, reduce =3D 0% 2012-04-10 16:15:13,692 Stage-1 map =3D 8%, reduce =3D 0% 2012-04-10 16:15:14,697 Stage-1 map =3D 10%, reduce =3D 0% 2012-04-10 16:15:15,756 Stage-1 map =3D 12%, reduce =3D 0% 2012-04-10 16:15:16,761 Stage-1 map =3D 13%, reduce =3D 0% 2012-04-10 16:15:17,767 Stage-1 map =3D 14%, reduce =3D 0% 2012-04-10 16:15:18,773 Stage-1 map =3D 16%, reduce =3D 0% 2012-04-10 16:15:19,778 Stage-1 map =3D 17%, reduce =3D 1% 2012-04-10 16:15:20,784 Stage-1 map =3D 18%, reduce =3D 1% 2012-04-10 16:15:21,789 Stage-1 map =3D 20%, reduce =3D 1% 2012-04-10 16:15:22,795 Stage-1 map =3D 21%, reduce =3D 5% 2012-04-10 16:15:23,800 Stage-1 map =3D 23%, reduce =3D 5% 2012-04-10 16:15:24,805 Stage-1 map =3D 24%, reduce =3D 5% 2012-04-10 16:15:25,936 Stage-1 map =3D 25%, reduce =3D 8% 2012-04-10 16:15:26,941 Stage-1 map =3D 27%, reduce =3D 8% 2012-04-10 16:15:27,947 Stage-1 map =3D 28%, reduce =3D 8% 2012-04-10 16:15:28,951 Stage-1 map =3D 30%, reduce =3D 8% 2012-04-10 16:15:29,956 Stage-1 map =3D 31%, reduce =3D 8% 2012-04-10 16:15:30,981 Stage-1 map =3D 32%, reduce =3D 8% 2012-04-10 16:15:31,987 Stage-1 map =3D 34%, reduce =3D 8% 2012-04-10 16:15:32,992 Stage-1 map =3D 35%, reduce =3D 10% 2012-04-10 16:15:33,998 Stage-1 map =3D 37%, reduce =3D 10% 2012-04-10 16:15:35,003 Stage-1 map =3D 38%, reduce =3D 10% 2012-04-10 16:15:36,055 Stage-1 map =3D 40%, reduce =3D 10% 2012-04-10 16:15:37,097 Stage-1 map =3D 42%, reduce =3D 10% 2012-04-10 16:15:38,102 Stage-1 map =3D 43%, reduce =3D 10% 2012-04-10 16:15:39,108 Stage-1 map =3D 44%, reduce =3D 10% 2012-04-10 16:15:40,113 Stage-1 map =3D 46%, reduce =3D 10% 2012-04-10 16:15:41,123 Stage-1 map =3D 47%, reduce =3D 10% 2012-04-10 16:15:42,128 Stage-1 map =3D 49%, reduce =3D 15% 2012-04-10 16:15:43,134 Stage-1 map =3D 50%, reduce =3D 15% 2012-04-10 16:15:44,139 Stage-1 map =3D 53%, reduce =3D 15% 2012-04-10 16:15:46,152 Stage-1 map =3D 54%, reduce =3D 15% 2012-04-10 16:15:47,158 Stage-1 map =3D 57%, reduce =3D 15% 2012-04-10 16:15:48,164 Stage-1 map =3D 58%, reduce =3D 15% 2012-04-10 16:15:49,171 Stage-1 map =3D 60%, reduce =3D 15% 2012-04-10 16:15:50,176 Stage-1 map =3D 61%, reduce =3D 15% 2012-04-10 16:15:51,182 Stage-1 map =3D 63%, reduce =3D 19% 2012-04-10 16:15:52,199 Stage-1 map =3D 65%, reduce =3D 19% 2012-04-10 16:15:53,222 Stage-1 map =3D 66%, reduce =3D 19% 2012-04-10 16:15:54,228 Stage-1 map =3D 68%, reduce =3D 19% 2012-04-10 16:15:55,234 Stage-1 map =3D 70%, reduce =3D 19% 2012-04-10 16:15:56,241 Stage-1 map =3D 71%, reduce =3D 19% 2012-04-10 16:15:57,248 Stage-1 map =3D 73%, reduce =3D 21% 2012-04-10 16:15:58,253 Stage-1 map =3D 75%, reduce =3D 21% 2012-04-10 16:15:59,260 Stage-1 map =3D 76%, reduce =3D 21% 2012-04-10 16:16:00,267 Stage-1 map =3D 79%, reduce =3D 21% 2012-04-10 16:16:01,273 Stage-1 map =3D 80%, reduce =3D 21% 2012-04-10 16:16:02,280 Stage-1 map =3D 81%, reduce =3D 21% 2012-04-10 16:16:03,287 Stage-1 map =3D 83%, reduce =3D 27% 2012-04-10 16:16:04,294 Stage-1 map =3D 84%, reduce =3D 27% 2012-04-10 16:16:05,302 Stage-1 map =3D 86%, reduce =3D 27% 2012-04-10 16:16:06,310 Stage-1 map =3D 87%, reduce =3D 27% 2012-04-10 16:16:07,317 Stage-1 map =3D 90%, reduce =3D 27% 2012-04-10 16:16:08,325 Stage-1 map =3D 91%, reduce =3D 27% 2012-04-10 16:16:09,332 Stage-1 map =3D 92%, reduce =3D 27% 2012-04-10 16:16:10,339 Stage-1 map =3D 94%, reduce =3D 27% 2012-04-10 16:16:11,348 Stage-1 map =3D 95%, reduce =3D 27% 2012-04-10 16:16:12,355 Stage-1 map =3D 97%, reduce =3D 29% 2012-04-10 16:16:13,362 Stage-1 map =3D 99%, reduce =3D 29% 2012-04-10 16:16:14,370 Stage-1 map =3D 100%, reduce =3D 29% 2012-04-10 16:16:18,396 Stage-1 map =3D 100%, reduce =3D 32% 2012-04-10 16:16:24,654 Stage-1 map =3D 100%, reduce =3D 67% 2012-04-10 16:16:27,683 Stage-1 map =3D 100%, reduce =3D 70% 2012-04-10 16:16:30,701 Stage-1 map =3D 100%, reduce =3D 73% 2012-04-10 16:16:33,719 Stage-1 map =3D 100%, reduce =3D 77% 2012-04-10 16:16:36,739 Stage-1 map =3D 100%, reduce =3D 80% 2012-04-10 16:16:39,781 Stage-1 map =3D 100%, reduce =3D 84% 2012-04-10 16:16:42,806 Stage-1 map =3D 100%, reduce =3D 88% 2012-04-10 16:16:45,824 Stage-1 map =3D 100%, reduce =3D 92% 2012-04-10 16:16:48,840 Stage-1 map =3D 100%, reduce =3D 97% 2012-04-10 16:16:50,854 Stage-1 map =3D 100%, reduce =3D 100% Ended Job =3D job_201202131643_1294 Launching Job 2 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=3D In order to limit the maximum number of reducers: set hive.exec.reducers.max=3D In order to set a constant number of reducers: set mapred.reduce.tasks=3D Starting Job =3D job_201202131643_1295, Tracking URL =3D = http://hadoop001:50030/jobdetails.jsp?jobid=3Djob_201202131643_1295 Kill Command =3D /usr/lib/hadoop/bin/hadoop job = -Dmapred.job.tracker=3Dhadoop001:6932 -kill job_201202131643_1295 2012-04-10 16:16:56,693 Stage-2 map =3D 0%, reduce =3D 0% 2012-04-10 16:17:02,716 Stage-2 map =3D 100%, reduce =3D 0% 2012-04-10 16:17:12,759 Stage-2 map =3D 100%, reduce =3D 100% Ended Job =3D job_201202131643_1295 OK 1 678832 67850 678832.0 2 1360529 135253 1360529.0 3 4784635 460994 4784635.0 Time taken: 131.748 seconds hive> From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]=20 Sent: Tuesday, April 10, 2012 11:44 AM To: user@hive.apache.org Subject: [Marketing Mail] Re: Why BucketJoinMap consume too much memory =20 Hi Binh You are right, here both of your tables are of the same size. And = loading 2GB od data into hash tables and then to temp files and so on = would take some time. This time becomes negligible if it was like, one = table was of 2GB and other of 2TB, then you'll notice the wide = difference in performance between a common join and bucketed map join. If one of the table is too small map join would be good, if it is = of moderate size then bucketed map join.=20 =20 Regards Bejoy KS =20 _____ =20 From: binhnt22 To: user@hive.apache.org=20 Cc: 'Bejoy Ks' =20 Sent: Tuesday, April 10, 2012 8:10 AM Subject: RE: Why BucketJoinMap consume too much memory =20 Hi Bejoy, =20 It worked like a charm. Thank you very much. I really really appreciate = your help. =20 This bucket join should be used with 1 big table and 1 small table.=20 =20 If both table are big, the join time would be much more than normal = join. =20 Best regards Nguyen Thanh Binh (Mr) Cell phone: (+84)98.226.0622 =20 From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]=20 Sent: Monday, April 09, 2012 9:49 PM To: user@hive.apache.org Subject: Re: Why BucketJoinMap consume too much memory =20 Hi Binh It is just an issue with the number of buckets. Your tables has = just 8 buckets, as there only 8 files are seen the storage directory. = You might have just issued an ALTER TABLE script on an existing bucketed = table. The work around here is =20 1) You need to wipe and reload the tables with = hive.enforce.bucketing=3Dtrue; Ensure your storage directory as that many files as the number of = buckets. As per your table DDL you should see 256 files. =20 2) Enable hive.optimize.bucketmapjoin =3D true; and try doing the join = again. =20 It should definitely work. =20 Regards Bejoy KS =20 _____ =20 From: binhnt22 To: user@hive.apache.org=20 Cc: 'Bejoy Ks' =20 Sent: Monday, April 9, 2012 8:42 AM Subject: RE: Why BucketJoinMap consume too much memory =20 Hi Bejoy, Thank you for helping me. Here is the information=20 =20 1. Describe Formatted ra_md_syn; # col_name data_type comment =20 calling string None total_duration bigint None total_volume bigint None total_charge bigint None =20 # Detailed Table Information Database: default Owner: hduser CreateTime: Thu Apr 05 09:48:29 ICT 2012 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: = hdfs://master:54310/user/hive/warehouse/ra_md_syn Table Type: MANAGED_TABLE Table Parameters: numFiles 8 numPartitions 0 numRows 0 rawDataSize 0 totalSize 1872165483 transient_lastDdlTime 1333595095 =20 # Storage Information SerDe Library: = org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: = org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputForm = at Compressed: No Num Buckets: 256 Bucket Columns: [calling] Sort Columns: [] Storage Desc Params: serialization.format 1 =20 2. Describe Formatted ra_ocs_syn; # col_name data_type comment =20 calling string None total_duration bigint None total_volume bigint None total_charge bigint None =20 # Detailed Table Information Database: default Owner: hduser CreateTime: Thu Apr 05 09:48:24 ICT 2012 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: = hdfs://master:54310/user/hive/warehouse/ra_ocs_syn Table Type: MANAGED_TABLE Table Parameters: numFiles 8 numPartitions 0 numRows 0 rawDataSize 0 totalSize 1872162225 transient_lastDdlTime 1333595512 =20 # Storage Information SerDe Library: = org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: = org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: 256 Bucket Columns: [calling] Sort Columns: [] Storage Desc Params: serialization.format 1 =20 3. hadoop fs -du [hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du = /user/hive/warehouse/ra_md_syn Found 8 items 280371407 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000000_1 280371407 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000001_0 274374970 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000002_1 274374970 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000003_1 269949415 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000004_0 262439067 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000005_0 205767721 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000006_0 24516526 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000007_0 =20 4. hadoop fs -du [hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du = /user/hive/warehouse/ra_ocs_syn Found 8 items 314639270 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000000_0 314639270 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000001_0 304959363 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000002_0 274374381 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000003_1 264694474 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000004_0 257498693 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000005_1 100334604 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000006_0 41022170 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000007_0 =20 5. hive-site.xml [hduser@master ~]$ cat hive-0.8.1/conf/hive-site.xml =20 hive.metastore.local true javax.jdo.option.ConnectionURL = jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=3Dtrue javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName hadoop javax.jdo.option.ConnectionPassword hadoop =20 =20 hive.metastore.sasl.enabled true If true, the metastore thrift interface will be = secured with SASL. Clients must authenticate with = Kerberos. =20 hive.metastore.kerberos.keytab.file The path to the Kerberos Keytab file containing = the metastore thrift server's service principal. =20 hive.metastore.kerberos.principal hduser/admin@EXAMPLE.COM = The service principal for the metastore thrift = server. The special string _HOST will be replaced automatically with the = correct host name. =20 6. Performing JOIN hive> set hive.optimize.bucketmapjoin =3D true; hive> set hive.enforce.bucketing=3Dtrue; hive> set = hive.input.format=3Dorg.apache.hadoop.hive.ql.io.BucketizedHiveInputForma= t; hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b > on (a.calling =3D b.calling) where a.total_volume <> = b.total_volume; Total MapReduce jobs =3D 1 WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. = Please use org.apache.hadoop.log.metrics.EventCounter in all the = log4j.properties files. Execution log at: = /tmp/hduser/hduser_20120409013131_da40d787-597d-490c-8558-9d10ec11e916.lo= g 2012-04-09 01:31:24 Starting to launch local task to process map = join; maximum memory =3D 13 98145024 2012-04-09 01:31:28 Processing rows: 200000 Hashtable size: = 199999 Memory usage: 754 05504 rate: = 0.054 2012-04-09 01:31:29 Processing rows: 300000 Hashtable size: = 299999 Memory usage: 111 540296 rate: = 0.08 2012-04-09 01:31:32 Processing rows: 400000 Hashtable size: = 399999 Memory usage: 151 640080 rate: = 0.108 2012-04-09 01:31:35 Processing rows: 500000 Hashtable size: = 499999 Memory usage: 185 503416 rate: = 0.133 2012-04-09 01:31:37 Processing rows: 600000 Hashtable size: = 599999 Memory usage: 221 503440 rate: = 0.158 2012-04-09 01:31:42 Processing rows: 700000 Hashtable size: = 699999 Memory usage: 257 484264 rate: = 0.184 2012-04-09 01:31:47 Processing rows: 800000 Hashtable size: = 799999 Memory usage: 297 678568 rate: = 0.213 2012-04-09 01:31:52 Processing rows: 900000 Hashtable size: = 899999 Memory usage: 333 678592 rate: = 0.239 2012-04-09 01:31:57 Processing rows: 1000000 Hashtable size: = 999999 Memory usage: 369 678568 rate: = 0.264 2012-04-09 01:32:03 Processing rows: 1100000 Hashtable size: = 1099999 Memory usage: 405 678568 rate: = 0.29 2012-04-09 01:32:09 Processing rows: 1200000 Hashtable size: = 1199999 Memory usage: 441 678592 rate: = 0.316 2012-04-09 01:32:15 Processing rows: 1300000 Hashtable size: = 1299999 Memory usage: 477 678568 rate: = 0.342 2012-04-09 01:32:23 Processing rows: 1400000 Hashtable size: = 1399999 Memory usage: 513 678592 rate: = 0.367 2012-04-09 01:32:29 Processing rows: 1500000 Hashtable size: = 1499999 Memory usage: 549 678568 rate: = 0.393 2012-04-09 01:32:35 Processing rows: 1600000 Hashtable size: = 1599999 Memory usage: 602 455824 rate: = 0.431 2012-04-09 01:32:45 Processing rows: 1700000 Hashtable size: = 1699999 Memory usage: 630 067176 rate: = 0.451 2012-04-09 01:32:53 Processing rows: 1800000 Hashtable size: = 1799999 Memory usage: 666 067176 rate: = 0.476 2012-04-09 01:33:01 Processing rows: 1900000 Hashtable size: = 1899999 Memory usage: 702 067200 rate: = 0.502 2012-04-09 01:33:09 Processing rows: 2000000 Hashtable size: = 1999999 Memory usage: 738 067176 rate: = 0.528 2012-04-09 01:33:20 Processing rows: 2100000 Hashtable size: = 2099999 Memory usage: 774 254456 rate: = 0.554 2012-04-09 01:33:29 Processing rows: 2200000 Hashtable size: = 2199999 Memory usage: 810 067176 rate: = 0.579 2012-04-09 01:33:38 Processing rows: 2300000 Hashtable size: = 2299999 Memory usage: 846 568480 rate: = 0.605 2012-04-09 01:33:49 Processing rows: 2400000 Hashtable size: = 2399999 Memory usage: 882 096752 rate: = 0.631 2012-04-09 01:33:59 Processing rows: 2500000 Hashtable size: = 2499999 Memory usage: 918 821920 rate: = 0.657 2012-04-09 01:34:15 Processing rows: 2600000 Hashtable size: = 2599999 Memory usage: 954 134920 rate: = 0.682 2012-04-09 01:34:26 Processing rows: 2700000 Hashtable size: = 2699999 Memory usage: 990 067168 rate: = 0.708 2012-04-09 01:34:38 Processing rows: 2800000 Hashtable size: = 2799999 Memory usage: 102 7113288 rate: = 0.735 Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap = space at java.util.Arrays.copyOf(Arrays.java:2882) at = java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java= :100) at = java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:597) at java.lang.StringBuilder.append(StringBuilder.java:212) at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:247) at = org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:232) Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap = space at = java.util.jar.Manifest$FastInputStream.(Manifest.java:315) at = java.util.jar.Manifest$FastInputStream.(Manifest.java:310) at java.util.jar.Manifest.read(Manifest.java:178) at java.util.jar.Manifest.(Manifest.java:52) at = java.util.jar.JarFile.getManifestFromReference(JarFile.java:167) at java.util.jar.JarFile.getManifest(JarFile.java:148) at = sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:696) at java.net.URLClassLoader.defineClass(URLClassLoader.java:228) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at org.apache.hadoop.util.RunJar$1.run(RunJar.java:126) Execution failed with exit status: 2 Obtaining error information =20 Task failed! Task ID: Stage-3 =20 Logs: =20 /tmp/hduser/hive.log FAILED: Execution Error, return code 2 from = org.apache.hadoop.hive.ql.exec.MapredLocalTask 7. /tmp/hduser/hive.log 2012-04-09 02:00:10,654 ERROR DataNucleus.Plugin = (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires = "org.eclipse.core.resources" but it cannot be resolved. 2012-04-09 02:00:10,654 ERROR DataNucleus.Plugin = (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires = "org.eclipse.core.resources" but it cannot be resolved. 2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin = (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires = "org.eclipse.core.runtime" but it cannot be resolved. 2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin = (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires = "org.eclipse.core.runtime" but it cannot be resolved. 2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin = (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires = "org.eclipse.text" but it cannot be resolved. 2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin = (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires = "org.eclipse.text" but it cannot be resolved. 2012-04-09 02:00:12,796 WARN parse.SemanticAnalyzer = (SemanticAnalyzer.java:genBodyPlan(5821)) - Common Gby keys:null 2012-04-09 02:09:02,356 ERROR exec.Task = (SessionState.java:printError(380)) - Execution failed with exit status: = 2 2012-04-09 02:09:02,357 ERROR exec.Task = (SessionState.java:printError(380)) - Obtaining error information 2012-04-09 02:09:02,358 ERROR exec.Task = (SessionState.java:printError(380)) - Task failed! Task ID: Stage-3 =20 Logs: =20 2012-04-09 02:09:02,358 ERROR exec.Task = (SessionState.java:printError(380)) - /tmp/hduser/hive.log 2012-04-09 02:09:02,359 ERROR exec.MapredLocalTask = (MapredLocalTask.java:execute(228)) - Execution failed with exit status: = 2 2012-04-09 02:09:02,377 ERROR ql.Driver = (SessionState.java:printError(380)) - FAILED: Execution Error, return = code 2 from org.apache.hadoop.hive.ql.exec.MapredLocalTask =20 =20 Best regards Nguyen Thanh Binh (Mr) Cell phone: (+84)98.226.0622 =20 From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]=20 Sent: Friday, April 06, 2012 11:33 PM To: user@hive.apache.org Subject: Re: Why BucketJoinMap consume too much memory =20 Hi Binh, From the information you provided bucketed map join should be = posible. I'm clueless now, but still I can make one more try, if you = could provide me the output of the following =20 1) Describe Formatted ra_md_syn; 2) Describe Formatted ra_ocs_syn. =20 3) hadoop fs -du 4) hadoop fs -du =20 5) perform the join and paste the full console log along with the query. = (with all the properties set at CLI) =20 6) your hive-site.xml =20 @Alex You can use non equality conditions in the where clause. Only the = ON conditions should be equality ones. =20 =20 Regards Bejoy KS =20 =20 _____ =20 From: gemini alex To: user@hive.apache.org=20 Cc: Bejoy Ks =20 Sent: Friday, April 6, 2012 12:36 PM Subject: Re: Why BucketJoinMap consume too much memory =20 I guess the problem is you can't using <> predicate in bucket join, try = to=20 select c.* from ( select /*+ MAPJOIN(b) */ a.calling calling ,a. total_volume = atotal_volume , b.total_volume btotal_volume from ra_md_syn a join = ra_ocs_syn b on (a.calling =3D b.calling) ) c where = c.atotal_volumn<>c.btotal_volume ; =20 =20 =20 =E5=9C=A8 2012=E5=B9=B44=E6=9C=886=E6=97=A5 = =E4=B8=8A=E5=8D=889:19=EF=BC=8Cbinhnt22 = =E5=86=99=E9=81=93=EF=BC=9A Hi Bejoy, =20 Sorry for late response. I will start to demonstrate over again to clear = some information. =20 I have 2 tables, nearly same. Both has the same table structure, 65m = records, 2GB size (same size). hive> describe ra_md_syn; OK calling string total_duration bigint total_volume bigint total_charge bigint =20 Both of them were bucketized into 256 buckets on = =E2=80=98calling=E2=80=99 column (in the last time only 10 buckets, I = tried to increase it as you suggested). And I want to find all = =E2=80=98calling=E2=80=99 exists in both tables but different = =E2=80=98total_volume=E2=80=99 The script as you knew: =20 hive> set hive.optimize.bucketmapjoin =3D true; hive> set hive.enforce.bucketing=3Dtrue; hive> set = hive.input.format=3Dorg.apache.hadoop.hive.ql.io.CombineHiveInputFormat; hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b > on (a.calling =3D b.calling) where a.total_volume <> = b.total_volume; =20 And the result was exactly in my last email. Java heap space error. With = total size is only 2GB and 256 buckets, I think bucket size is = impossible to be the issue here. =20 Please give me some advice, I really appreciate Best regards Nguyen Thanh Binh (Mr) Cell phone: (+84)98.226.0622 =20 From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]=20 Sent: Thursday, April 05, 2012 7:23 PM To: user@hive.apache.org Subject: Re: Why BucketJoinMap consume too much memory =20 Hi Binh =20 I was just checking your local map join log , and I noticed two = things=20 - the memory usage by one hash table has got beyond 1G.=20 - Number of rows processed is just 2M =20 It is possible that, Each bucket it self is too large to be loaded in = memory. =20 As a work around or to nail down the bucket size is the issue here, can = you try increasing the number of buckets to 100 and try doing a bucketed = map join. =20 Also you mentioned the data size is 2Gb, is it the compressed data size? =20 2012-04-05 10:41:07 Processing rows: 2,900,000 Hashtable = size: 2899999 Memory usage: 1,062,065,576 rate: 0.76 =20 Regards Bejoy KS =20 =20 =20 _____ =20 From: Nitin Pawar To: user@hive.apache.org=20 Sent: Thursday, April 5, 2012 5:03 PM Subject: Re: Why BucketJoinMap consume too much memory =20 Can you tell me the size of table b?=20 =20 If you are doing bucketing and still size b table is huge then it will = reach this problem On Thu, Apr 5, 2012 at 4:22 PM, binhnt22 = wrote: Thank Nitin, =20 I tried but no luck. Here=E2=80=99s hive log, please spend a little time = to view it. =20 hive> set hive.optimize.bucketmapjoin =3D true; hive> set hive.enforce.bucketing=3Dtrue; hive> set = hive.input.format=3Dorg.apache.hadoop.hive.ql.io.CombineHiveInputFormat; hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b > on (a.calling =3D b.calling) where a.total_volume <> = b.total_volume; Total MapReduce jobs =3D 1 WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. = Please use org.apache.hadoop.log.metrics.EventCounter in all the = log4j.properties files. Execution log at: = /tmp/hduser/hduser_20120405103737_28ef26fe-a202-4047-b5ca-c40d9e3ad36c.lo= g 2012-04-05 10:37:45 Starting to launch local task to process map = join; maximum memory =3D 1398145024 2012-04-05 10:37:48 Processing rows: 200000 Hashtable size: = 199999 Memory usage: 75403880 rate: 0.054 2012-04-05 10:37:50 Processing rows: 300000 Hashtable size: = 299999 Memory usage: 111404664 rate: 0.08 2012-04-05 10:37:54 Processing rows: 400000 Hashtable size: = 399999 Memory usage: 151598960 rate: 0.108 2012-04-05 10:38:04 Processing rows: 500000 Hashtable size: = 499999 Memory usage: 185483368 rate: 0.133 2012-04-05 10:38:09 Processing rows: 600000 Hashtable size: = 599999 Memory usage: 221483392 rate: 0.158 2012-04-05 10:38:13 Processing rows: 700000 Hashtable size: = 699999 Memory usage: 257482640 rate: 0.184 2012-04-05 10:38:19 Processing rows: 800000 Hashtable size: = 799999 Memory usage: 297676944 rate: 0.213 2012-04-05 10:38:22 Processing rows: 900000 Hashtable size: = 899999 Memory usage: 333676968 rate: 0.239 2012-04-05 10:38:27 Processing rows: 1000000 Hashtable size: = 999999 Memory usage: 369676944 rate: 0.264 2012-04-05 10:38:31 Processing rows: 1100000 Hashtable size: = 1099999 Memory usage: 405676968 rate: 0.29 2012-04-05 10:38:36 Processing rows: 1200000 Hashtable size: = 1199999 Memory usage: 441676944 rate: 0.316 2012-04-05 10:38:42 Processing rows: 1300000 Hashtable size: = 1299999 Memory usage: 477676944 rate: 0.342 2012-04-05 10:38:47 Processing rows: 1400000 Hashtable size: = 1399999 Memory usage: 513676968 rate: 0.367 2012-04-05 10:38:52 Processing rows: 1500000 Hashtable size: = 1499999 Memory usage: 549676944 rate: 0.393 2012-04-05 10:39:00 Processing rows: 1600000 Hashtable size: = 1599999 Memory usage: 602454200 rate: 0.431 2012-04-05 10:39:08 Processing rows: 1700000 Hashtable size: = 1699999 Memory usage: 630065552 rate: 0.451 2012-04-05 10:39:14 Processing rows: 1800000 Hashtable size: = 1799999 Memory usage: 666065552 rate: 0.476 2012-04-05 10:39:20 Processing rows: 1900000 Hashtable size: = 1899999 Memory usage: 702065552 rate: 0.502 2012-04-05 10:39:26 Processing rows: 2000000 Hashtable size: = 1999999 Memory usage: 738065576 rate: 0.528 2012-04-05 10:39:36 Processing rows: 2100000 Hashtable size: = 2099999 Memory usage: 774065552 rate: 0.554 2012-04-05 10:39:43 Processing rows: 2200000 Hashtable size: = 2199999 Memory usage: 810065552 rate: 0.579 2012-04-05 10:39:51 Processing rows: 2300000 Hashtable size: = 2299999 Memory usage: 846065576 rate: 0.605 2012-04-05 10:40:16 Processing rows: 2400000 Hashtable size: = 2399999 Memory usage: 882085136 rate: 0.631 2012-04-05 10:40:24 Processing rows: 2500000 Hashtable size: = 2499999 Memory usage: 918085208 rate: 0.657 2012-04-05 10:40:39 Processing rows: 2600000 Hashtable size: = 2599999 Memory usage: 954065544 rate: 0.682 2012-04-05 10:40:48 Processing rows: 2700000 Hashtable size: = 2699999 Memory usage: 990065568 rate: 0.708 2012-04-05 10:40:56 Processing rows: 2800000 Hashtable size: = 2799999 Memory usage: 1026065552 rate: 0.734 2012-04-05 10:41:07 Processing rows: 2900000 Hashtable size: = 2899999 Memory usage: 1062065576 rate: 0.76 Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap = space =20 Best regards Nguyen Thanh Binh (Mr) Cell phone: (+84)98.226.0622 =20 From: Nitin Pawar [mailto:nitinpawar432@gmail.com]=20 Sent: Thursday, April 05, 2012 5:36 PM To: user@hive.apache.org Subject: Re: Why BucketJoinMap consume too much memory =20 can you try adding these settings=20 set hive.enforce.bucketing=3Dtrue; hive.input.format=3Dorg.apache.hadoop.hive.ql.io.CombineHiveInputFormat; =20 I have tried bucketing with 1000 buckets and with more than 1TB data = tables .. they do go through fine=20 =20 =20 On Thu, Apr 5, 2012 at 3:37 PM, binhnt22 = wrote: Hi Bejoy, =20 Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) and bucketized = on =E2=80=98calling=E2=80=99 column into 10 buckets. =20 As you said, hive will load only 1 bucket ~ 180-190MB into memory. = That=E2=80=99s hardly to blow the heap (1.3GB) =20 According to wiki, I set: =20 set = hive.input.format=3Dorg.apache.hadoop.hive.ql.io.BucketizedHiveInputForma= t; set hive.optimize.bucketmapjoin =3D true; set hive.optimize.bucketmapjoin.sortedmerge =3D true; =20 And run the following SQL =20 select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join = ra_ocs_cdr_ggsn_synthetic b=20 on (a.calling =3D b.calling) where a.total_volume <> b.total_volume; =20 But it still created many hash tables then threw Java Heap space error =20 Best regards Nguyen Thanh Binh (Mr) Cell phone: (+84)98.226.0622 =20 From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]=20 Sent: Thursday, April 05, 2012 3:07 PM To: user@hive.apache.org Subject: Re: Why BucketJoinMap consume too much memory =20 Hi Amit =20 Sorry for the delayed response, had a terrible schedule. AFAIK, = there is no flags that would help you to take the hash table creation, = compression and load into tmp files away from client node.=20 From my understanding if you use a Map side join, the small table = as a whole is converted into a hash table and compressed in a tmp file. = Say if your child jvm size is 1gb and this small table is 5GB, it'd blow = off jour job if the map tasks tries to get such a huge file in memory. = Bucketed map join can help here, if the table is bucketed ,say 100 = buckets then each bucket may have around 50mb of data. ie one tmp file = would be just less that 50mb, here mapper needs to load only the = required buckets in memory and thus hardly run into memory issues. Also on the client, The records are processed bucket by bucket and = loaded into tmp files. So if your bucket size is too large, than the = heap size specified for your client, it'd throw an out of memory. =20 Regards Bejoy KS =20 _____ =20 From: Amit Sharma To: user@hive.apache.org; Bejoy Ks =20 Sent: Tuesday, April 3, 2012 11:06 PM Subject: Re: Why BucketJoinMap consume too much memory =20 I am experiencing similar behavior in my queries. All the conditions for = bucketed map join are met, and the only difference in execution when i = set the hive.optimize.bucketmapjoin flag to true, is that instead of a = single hash table, multiple hash tables are created. All the Hash Tables = are still created on the client side and loaded into tmp files, which = are then distributed to the mappers using distributed cache. Can i find any example anywhere, which shows behavior of bucketed map = join, where in it does not create the has tables on the client itself? = If so, is there a flag for it? Thanks, Amit On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks wrote: Hi On a first look, it seems like map join is happening in your case = other than bucketed map join. The following conditions need to hold for = bucketed map join to work 1) Both the tables are bucketed on the join columns 2) The number of buckets in each table should be multiples of each other 3) Ensure that the table has enough number of buckets=20 Note: If the data is large say 1TB(per table) and if you have just a few = buckets say 100 buckets, each mapper may have to load 10GB>. This would = definitely blow your jvm . Bottom line is ensure your mappers are not = heavily loaded with the bucketed data distribution. Regards Bejoy.K.S _____ =20 From: binhnt22 To: user@hive.apache.org=20 Sent: Saturday, March 31, 2012 6:46 AM Subject: Why BucketJoinMap consume too much memory =20 I have 2 table, each has 6 million records and clustered into 10 = buckets =20 These tables are very simple with 1 key column and 1 value column, all I = want is getting the key that exists in both table but different value. =20 The normal did the trick, took only 141 secs. =20 select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic = b on (a.calling =3D b.calling) where a.total_volume <> b.total_volume; =20 I tried to use bucket join map by setting: set = hive.optimize.bucketmapjoin =3D true =20 select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join = ra_ocs_cdr_ggsn_synthetic b on (a.calling =3D b.calling) where = a.total_volume <> b.total_volume; =20 2012-03-30 11:35:09 Starting to launch local task to process map = join; maximum memory =3D 1398145024 2012-03-30 11:35:12 Processing rows: 200000 Hashtable size: = 199999 Memory usage: 86646704 rate: 0.062 2012-03-30 11:35:15 Processing rows: 300000 Hashtable size: = 299999 Memory usage: 128247464 rate: 0.092 2012-03-30 11:35:18 Processing rows: 400000 Hashtable size: = 399999 Memory usage: 174041744 rate: 0.124 2012-03-30 11:35:21 Processing rows: 500000 Hashtable size: = 499999 Memory usage: 214140840 rate: 0.153 2012-03-30 11:35:25 Processing rows: 600000 Hashtable size: = 599999 Memory usage: 255181504 rate: 0.183 2012-03-30 11:35:29 Processing rows: 700000 Hashtable size: = 699999 Memory usage: 296744320 rate: 0.212 2012-03-30 11:35:35 Processing rows: 800000 Hashtable size: = 799999 Memory usage: 342538616 rate: 0.245 2012-03-30 11:35:38 Processing rows: 900000 Hashtable size: = 899999 Memory usage: 384138552 rate: 0.275 2012-03-30 11:35:45 Processing rows: 1000000 Hashtable size: = 999999 Memory usage: 425719576 rate: 0.304 2012-03-30 11:35:50 Processing rows: 1100000 Hashtable size: = 1099999 Memory usage: 467319576 rate: 0.334 2012-03-30 11:35:56 Processing rows: 1200000 Hashtable size: = 1199999 Memory usage: 508940504 rate: 0.364 2012-03-30 11:36:04 Processing rows: 1300000 Hashtable size: = 1299999 Memory usage: 550521128 rate: 0.394 2012-03-30 11:36:09 Processing rows: 1400000 Hashtable size: = 1399999 Memory usage: 592121128 rate: 0.424 2012-03-30 11:36:15 Processing rows: 1500000 Hashtable size: = 1499999 Memory usage: 633720336 rate: 0.453 2012-03-30 11:36:22 Processing rows: 1600000 Hashtable size: = 1599999 Memory usage: 692097568 rate: 0.495 2012-03-30 11:36:33 Processing rows: 1700000 Hashtable size: = 1699999 Memory usage: 725308944 rate: 0.519 2012-03-30 11:36:40 Processing rows: 1800000 Hashtable size: = 1799999 Memory usage: 766946424 rate: 0.549 2012-03-30 11:36:48 Processing rows: 1900000 Hashtable size: = 1899999 Memory usage: 808527928 rate: 0.578 2012-03-30 11:36:55 Processing rows: 2000000 Hashtable size: = 1999999 Memory usage: 850127928 rate: 0.608 2012-03-30 11:37:08 Processing rows: 2100000 Hashtable size: = 2099999 Memory usage: 891708856 rate: 0.638 2012-03-30 11:37:16 Processing rows: 2200000 Hashtable size: = 2199999 Memory usage: 933308856 rate: 0.668 2012-03-30 11:37:25 Processing rows: 2300000 Hashtable size: = 2299999 Memory usage: 974908856 rate: 0.697 2012-03-30 11:37:34 Processing rows: 2400000 Hashtable size: = 2399999 Memory usage: 1016529448 rate: 0.727 2012-03-30 11:37:43 Processing rows: 2500000 Hashtable size: = 2499999 Memory usage: 1058129496 rate: 0.757 2012-03-30 11:37:58 Processing rows: 2600000 Hashtable size: = 2599999 Memory usage: 1099708832 rate: 0.787 Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap = space =20 My system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them = containts NameNode, JobTracker, Hive Server and all of them contain = DataNode, TaskTracker =20 In all node, I set: export HADOOP_HEAPSIZE=3D1500 in hadoop-env.sh (~ = 1.3GB heap) =20 I want to ask you experts, why bucket join map consume too much memory? = Am I wrong or my configuration is bad? =20 Best regards, =20 =20 =20 =20 =20 --=20 Nitin Pawar =20 --=20 Nitin Pawar =20 =20 =20 =20 =20 --Boundary_(ID_Xs6DU59ehoTRUXpe5DR+0g) Content-type: text/html; charset=utf-8 Content-transfer-encoding: quoted-printable

Hi = Ladda,

 

Your = case is pretty simple, when you make table alias (a11, a12), you should use it = in the hint MAPJOIN

 

That=E2=80=99s mean your sql should be look like:

 

select /*+ MAPJOIN(a11) */ a12.shipper_id, count(1), count (distinct a11.customer_id), = sum(a11.qty_sold) from orderfactpartclust2 a12 join orderdetailpartclust2 a11 on = (a11.order_id =3D a12.order_id) where (a11.order_date =3D '09-30-2008' and a12.order_date = =3D '2008-09-30') group by a12.shipper_id;

 

Best = regards

Nguyen Thanh Binh (Mr)

 

From:= Ladda, = Anand [mailto:lanand@microstrategy.com]
Sent: Wednesday, April 11, 2012 3:23 AM
To: user@hive.apache.org
Subject: RE: [Marketing Mail] Re: Why BucketJoinMap consume too = much memory

 

Hi Bejoy/Binh

Been following this thread to better understand where = bucket map join would help and it=E2=80=99s been a great thread to follow. I have = struggling with this on my end as well.

 

I have two tables one of which is about 22GB (orderdetailpartclust2) in size and the other is 1.5GB = (orderfactpartclust2) in size (all partitions combined) and I wanted to see the impact of = different kind of joins on one of the partitions of these table . =

 

I created a partitioned (order_date) and bucketed (on = order_id, on which I want to join these tables) version for these tables for this analysis. Data was loaded from their non-partitioned counterparts and = setting the following parameters to ensure that data makes it into the right = partitions and is bucketed correctly by Hive

 

set = hive.exec.dynamic.partition.mode=3Dnonstrict;

set = hive.exec.dynamic.partition=3Dtrue;

SET = hive.exec.max.dynamic.partitions=3D100000;

SET = hive.exec.max.dynamic.partitions.pernode=3D100000;

set hive.enforce.bucketing =3D = true;

 

However when I try to do the following join query, I = don=E2=80=99t get any bucketed map side join

 

select /*+ MAPJOIN(orderfactpartclust2) */ = a12.shipper_id, count(1), count (distinct a11.customer_id), sum(a11.qty_sold) from orderfactpartclust2 a12 join orderdetailpartclust2 a11 on (a11.order_id = =3D a12.order_id) where (a11.order_date =3D '09-30-2008' and a12.order_date = =3D '2008-09-30') group by a12.shipper_id;

 

Below are the relevant pieces of information on each of = these tables. Can you please help take a look to see what I might be missing = to get map side joins. Is it because my tables are also partitioned that this = isn=E2=80=99t working?

 

 

1.       hive> describe formatted orderdetailpartclust2;

OK

# col_name           = ;   data_type          &nbs= p;    comment

 

order_id  &nb= sp;           &nbs= p; int           &nbs= p;         from deserializer

item_id  &nbs= p;            = ;  int           &nbs= p;         from deserializer

emp_id   = ;            =    int           &nbs= p;         from deserializer

promotion_id  = ;          int              &= nbsp;      from deserializer

customer_id  =            int           &nbs= p;         from deserializer

qty_sold  &nb= sp;           &nbs= p; float           &n= bsp;       from deserializer

unit_price  &= nbsp;           float           &n= bsp;       from deserializer

unit_cost  &n= bsp;            float        &n= bsp;          from deserializer

discount  &nb= sp;           &nbs= p; float           &n= bsp;       from deserializer

 

# Partition = Information

# col_name           = ;   data_type          &nbs= p;    comment

 

order_date  &= nbsp;           string           &= nbsp;      None

 

# Detailed Table = Information

Database:  &n= bsp;            default

Owner:   = ;            =    hdfs

CreateTime:  =            Thu Apr 05 17:01:22 EDT 2012

LastAccessTime: &n= bsp;       UNKNOWN

Protect Mode:           = None

Retention:  &= nbsp;           0

Location:  &n= bsp;            hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2

Table Type:           &n= bsp; MANAGED_TABLE

Table = Parameters:

    = ;    SORTBUCKETCOLSPREFIX    TRUE

    = ;    numFiles           = ;     19200

    = ;    numPartitions          = 75

        numRows     =             0

    = ;    totalSize          &nbs= p;    22814162038

    = ;    transient_lastDdlTime   1333725153

 

# Storage = Information

SerDe Library:          org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe

InputFormat:  = ;          org.apache.hadoop.hive.ql.io.RCFileInputFormat

OutputFormat: &nbs= p;         org.apache.hadoop.hive.ql.io.RCFileOutputFormat

Compressed:  =            No

Num Buckets:           = ; 256

Bucket Columns:         = [order_id]

Sort Columns:           [Order(col:order_id, order:1)]

Storage Desc = Params:

    = ;    escape.delim          &= nbsp; \\

    = ;    field.delim          &n= bsp;  \t

    = ;    serialization.format    \t

Time taken: 3.255 = seconds

2.       hive> describe formatted orderfactpartclust2;

OK

# col_name           = ;   data_type          &nbs= p;    comment

 

order_id  &nb= sp;           &nbs= p; int           &nbs= p;         from deserializer

emp_id   = ;            =    int           &nbs= p;         from deserializer

order_amt  &n= bsp;            float           &n= bsp;       from deserializer

order_cost  &= nbsp;           float           &n= bsp;       from deserializer

qty_sold  &nb= sp;           &nbs= p; float           &n= bsp;       from deserializer

freight  &nbs= p;            = ;  float           &n= bsp;       from deserializer

gross_dollar_sales = ;     float           &n= bsp;       from deserializer

ship_date  &n= bsp;            string           &= nbsp;      from deserializer

rush_order  &= nbsp;           string           &= nbsp;      from deserializer

customer_id  =            int           &nbs= p;         from deserializer

pymt_type  &n= bsp;            int           &nbs= p;         from deserializer

shipper_id  &= nbsp;           int           &nbs= p;         from deserializer

 

# Partition = Information

# col_name           = ;   data_type          &nbs= p;    comment

 

order_date  &= nbsp;           string           &= nbsp;      None

 

# Detailed Table = Information

Database:  &n= bsp;            default

Owner:   = ;            =    hdfs

CreateTime:  =            Thu Apr 05 18:09:28 EDT 2012

LastAccessTime: &n= bsp;       UNKNOWN

Protect Mode:           = None

Retention:  &= nbsp;           0

Location:  &n= bsp;            hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2<= /span>

Table Type:           &n= bsp; MANAGED_TABLE

Table = Parameters:

    = ;    SORTBUCKETCOLSPREFIX    TRUE

    = ;    numFiles           = ;     7680

    = ;    numPartitions          = 30

    = ;    numRows           =       0

    = ;    totalSize          &nbs= p;    1528946078

    = ;    transient_lastDdlTime   1333722539

 

# Storage = Information

SerDe Library:          org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe

InputFormat:  = ;          org.apache.hadoop.hive.ql.io.RCFileInputFormat

OutputFormat: &nbs= p;         org.apache.hadoop.hive.ql.io.RCFileOutputFormat

Compressed:  =            No

Num Buckets:           = ; 256

Bucket Columns:         = [order_id]

Sort Columns:           [Order(col:order_id, order:1)]

Storage Desc = Params:

    = ;    escape.delim          &= nbsp; \\

    = ;    field.delim          &n= bsp;  \t

    = ;    serialization.format    \t

Time taken: 1.737 = seconds

 

3.       -bash-4.1$ hadoop fs -du /user/hive/warehouse/orderdetailpartclust2;

299867901   hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_dat= e=3D01-01-2008

.

=

.

=

.

=

311033139   = hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_dat= e=3D09-30-2008

4.       -bash-4.1$ hadoop fs -du /user/hive/warehouse/orderdetailpartclust2/order_date=3D09-30-2008;<= /o:p>

Found 256 = items

1213444  &nbs= p;  hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_dat= e=3D09-30-2008/000000_0

.

=

.

=

.

=

1213166  &nbs= p;  hdfs://hadoop001:6931/user/hive/warehouse/orderdetailpartclust2/order_dat= e=3D09-30-2008/000255_0

-bash-4.1$

5.       -bash-4.1$ hadoop fs -du /user/hive/warehouse/orderfactpartclust2;

Found 30 = items

50943109  &nb= sp; hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=3D= 2008-09-01

.

=

.

=

.

=

50902368  &nb= sp; hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=3D= 2008-09-30

6.       bash-4.1$ hadoop fs -du /user/hive/warehouse/orderfactpartclust2/order_date=3D2008-09-30;

Found 256 = items

198692   = ;   hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=3D= 2008-09-30/000000_0

.

=

.

=

.

=

198954   = ;   hdfs://hadoop001:6931/user/hive/warehouse/orderfactpartclust2/order_date=3D= 2008-09-30/000255_0

 

7.       -bash-4.1$ cat = hive-site.xml

<?xml version=3D"1.0"?>

<?xml-stylesheet type=3D"text/xsl" = href=3D"configuration.xsl"?>

 

<configuration>

 

<!-- Hive = Configuration can either be stored in this file or in the hadoop configuration = files  -->

<!-- that are = implied by Hadoop setup variables.          &nb= sp;           &nbs= p;            = ;            = -->

<!-- Aside from = Hadoop setup variables - this file is provided as a convenience so that Hive    -->

<!-- users do not = have to edit hadoop configuration files (that may be managed as a centralized = -->

<!-- resource).          &nb= sp;           &nbs= p;            = ;            =             &= nbsp;           &n= bsp;         -->

 

<!-- Hive Execution Parameters -->

 

<property>

  <name>javax.jdo.option.ConnectionURL</name>=

  <!-- jdbc:derby:/hadoophome/metastore_db;create=3Dtrue = -->

  <value>jdbc:derby://hadoop010:1527/;databaseName=3Dmetastore_db;cre= ate=3Dtrue</value>

  <description>JDBC connect string for a JDBC = metastore</description>

</property><= /o:p>

 

<property>

  <name>javax.jdo.option.ConnectionDriverName</name>=

  = <value>org.apache.derby.jdbc.EmbeddedDriver</value>

  <description>Driver class name for a JDBC = metastore</description>

</property><= /o:p>

 

<property>

  <name>hive.hwi.war.file</name>

  <value>/usr/lib/hive/lib/hive-hwi-0.7.0-cdh3u0.war</value>

  = <description>This is the WAR file with the jsp content for Hive Web = Interface</description>

</property><= /o:p>

 

</configuration><= o:p>

 

8.       Performing = Join

 

hive> set hive.optimize.bucketmapjoin=3Dtrue;

hive> set hive.enforce.bucketing=3Dtrue;

hive> set hive.input.format=3Dorg.apache.hadoop.hive.ql.io.BucketizedHiveInputForma= t;

hive> select /*+ MAPJOIN(orderfactpartclust2) */ a12.shipper_id, count(1), count = (distinct a11.customer_id), sum(a11.qty_sold) from orderfactpartclust2 a12 join orderdetailpartclust2 a11 on (a11.order_id =3D a12.order_id) where (a11.order_date =3D '09-30-2008' and a12.order_date =3D '2008-09-30') = group by a12.shipper_id;

Total MapReduce jobs = =3D 2

Launching Job 1 out of = 2

Number of reduce tasks = not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=3D<number>

In order to limit the = maximum number of reducers:

  set hive.exec.reducers.max=3D<number>

In order to set a = constant number of reducers:

  set mapred.reduce.tasks=3D<number>

Starting Job =3D job_201202131643_1294, Tracking URL =3D http://hadoop001:50030/jobdetails.jsp?jobid=3Djob_201202131643_1294

Kill Command =3D /usr/lib/hadoop/bin/hadoop job  = -Dmapred.job.tracker=3Dhadoop001:6932 -kill job_201202131643_1294

2012-04-10 = 16:15:06,663 Stage-1 map =3D 0%,  reduce =3D 0%

2012-04-10 = 16:15:08,671 Stage-1 map =3D 1%,  reduce =3D 0%

2012-04-10 = 16:15:09,675 Stage-1 map =3D 3%,  reduce =3D 0%

2012-04-10 = 16:15:10,679 Stage-1 map =3D 4%,  reduce =3D 0%

2012-04-10 = 16:15:11,683 Stage-1 map =3D 5%,  reduce =3D 0%

2012-04-10 = 16:15:12,688 Stage-1 map =3D 7%,  reduce =3D 0%

2012-04-10 = 16:15:13,692 Stage-1 map =3D 8%,  reduce =3D 0%

2012-04-10 = 16:15:14,697 Stage-1 map =3D 10%,  reduce =3D 0%

2012-04-10 = 16:15:15,756 Stage-1 map =3D 12%,  reduce =3D 0%

2012-04-10 = 16:15:16,761 Stage-1 map =3D 13%,  reduce =3D 0%

2012-04-10 = 16:15:17,767 Stage-1 map =3D 14%,  reduce =3D 0%

2012-04-10 = 16:15:18,773 Stage-1 map =3D 16%,  reduce =3D 0%

2012-04-10 = 16:15:19,778 Stage-1 map =3D 17%,  reduce =3D 1%

2012-04-10 = 16:15:20,784 Stage-1 map =3D 18%,  reduce =3D 1%

2012-04-10 = 16:15:21,789 Stage-1 map =3D 20%,  reduce =3D 1%

2012-04-10 = 16:15:22,795 Stage-1 map =3D 21%,  reduce =3D 5%

2012-04-10 = 16:15:23,800 Stage-1 map =3D 23%,  reduce =3D 5%

2012-04-10 = 16:15:24,805 Stage-1 map =3D 24%,  reduce =3D 5%

2012-04-10 = 16:15:25,936 Stage-1 map =3D 25%,  reduce =3D 8%

2012-04-10 = 16:15:26,941 Stage-1 map =3D 27%,  reduce =3D 8%

2012-04-10 = 16:15:27,947 Stage-1 map =3D 28%,  reduce =3D 8%

2012-04-10 = 16:15:28,951 Stage-1 map =3D 30%,  reduce =3D 8%

2012-04-10 = 16:15:29,956 Stage-1 map =3D 31%,  reduce =3D 8%

2012-04-10 = 16:15:30,981 Stage-1 map =3D 32%,  reduce =3D 8%

2012-04-10 = 16:15:31,987 Stage-1 map =3D 34%,  reduce =3D 8%

2012-04-10 = 16:15:32,992 Stage-1 map =3D 35%,  reduce =3D 10%

2012-04-10 = 16:15:33,998 Stage-1 map =3D 37%,  reduce =3D 10%

2012-04-10 = 16:15:35,003 Stage-1 map =3D 38%,  reduce =3D 10%

2012-04-10 = 16:15:36,055 Stage-1 map =3D 40%,  reduce =3D 10%

2012-04-10 = 16:15:37,097 Stage-1 map =3D 42%,  reduce =3D 10%

2012-04-10 = 16:15:38,102 Stage-1 map =3D 43%,  reduce =3D 10%

2012-04-10 = 16:15:39,108 Stage-1 map =3D 44%,  reduce =3D 10%

2012-04-10 = 16:15:40,113 Stage-1 map =3D 46%,  reduce =3D 10%

2012-04-10 = 16:15:41,123 Stage-1 map =3D 47%,  reduce =3D 10%

2012-04-10 = 16:15:42,128 Stage-1 map =3D 49%,  reduce =3D 15%

2012-04-10 = 16:15:43,134 Stage-1 map =3D 50%,  reduce =3D 15%

2012-04-10 = 16:15:44,139 Stage-1 map =3D 53%,  reduce =3D 15%

2012-04-10 = 16:15:46,152 Stage-1 map =3D 54%,  reduce =3D 15%

2012-04-10 = 16:15:47,158 Stage-1 map =3D 57%,  reduce =3D 15%

2012-04-10 = 16:15:48,164 Stage-1 map =3D 58%,  reduce =3D 15%

2012-04-10 = 16:15:49,171 Stage-1 map =3D 60%,  reduce =3D 15%

2012-04-10 = 16:15:50,176 Stage-1 map =3D 61%,  reduce =3D 15%

2012-04-10 = 16:15:51,182 Stage-1 map =3D 63%,  reduce =3D 19%

2012-04-10 = 16:15:52,199 Stage-1 map =3D 65%,  reduce =3D 19%

2012-04-10 = 16:15:53,222 Stage-1 map =3D 66%,  reduce =3D 19%

2012-04-10 = 16:15:54,228 Stage-1 map =3D 68%,  reduce =3D 19%

2012-04-10 = 16:15:55,234 Stage-1 map =3D 70%,  reduce =3D 19%

2012-04-10 = 16:15:56,241 Stage-1 map =3D 71%,  reduce =3D 19%

2012-04-10 = 16:15:57,248 Stage-1 map =3D 73%,  reduce =3D 21%

2012-04-10 = 16:15:58,253 Stage-1 map =3D 75%,  reduce =3D 21%

2012-04-10 = 16:15:59,260 Stage-1 map =3D 76%,  reduce =3D 21%

2012-04-10 = 16:16:00,267 Stage-1 map =3D 79%,  reduce =3D 21%

2012-04-10 = 16:16:01,273 Stage-1 map =3D 80%,  reduce =3D 21%

2012-04-10 = 16:16:02,280 Stage-1 map =3D 81%,  reduce =3D 21%

2012-04-10 = 16:16:03,287 Stage-1 map =3D 83%,  reduce =3D 27%

2012-04-10 = 16:16:04,294 Stage-1 map =3D 84%,  reduce =3D 27%

2012-04-10 = 16:16:05,302 Stage-1 map =3D 86%,  reduce =3D 27%

2012-04-10 = 16:16:06,310 Stage-1 map =3D 87%,  reduce =3D 27%

2012-04-10 = 16:16:07,317 Stage-1 map =3D 90%,  reduce =3D 27%

2012-04-10 = 16:16:08,325 Stage-1 map =3D 91%,  reduce =3D 27%

2012-04-10 = 16:16:09,332 Stage-1 map =3D 92%,  reduce =3D 27%

2012-04-10 = 16:16:10,339 Stage-1 map =3D 94%,  reduce =3D 27%

2012-04-10 = 16:16:11,348 Stage-1 map =3D 95%,  reduce =3D 27%

2012-04-10 = 16:16:12,355 Stage-1 map =3D 97%,  reduce =3D 29%

2012-04-10 = 16:16:13,362 Stage-1 map =3D 99%,  reduce =3D 29%

2012-04-10 = 16:16:14,370 Stage-1 map =3D 100%,  reduce =3D 29%

2012-04-10 = 16:16:18,396 Stage-1 map =3D 100%,  reduce =3D 32%

2012-04-10 = 16:16:24,654 Stage-1 map =3D 100%,  reduce =3D 67%

2012-04-10 = 16:16:27,683 Stage-1 map =3D 100%,  reduce =3D 70%

2012-04-10 = 16:16:30,701 Stage-1 map =3D 100%,  reduce =3D 73%

2012-04-10 = 16:16:33,719 Stage-1 map =3D 100%,  reduce =3D 77%

2012-04-10 = 16:16:36,739 Stage-1 map =3D 100%,  reduce =3D 80%

2012-04-10 = 16:16:39,781 Stage-1 map =3D 100%,  reduce =3D 84%

2012-04-10 = 16:16:42,806 Stage-1 map =3D 100%,  reduce =3D 88%

2012-04-10 = 16:16:45,824 Stage-1 map =3D 100%,  reduce =3D 92%

2012-04-10 = 16:16:48,840 Stage-1 map =3D 100%,  reduce =3D 97%

2012-04-10 = 16:16:50,854 Stage-1 map =3D 100%,  reduce =3D 100%

Ended Job =3D job_201202131643_1294

Launching Job 2 out of = 2

Number of reduce tasks = not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

  set = hive.exec.reducers.bytes.per.reducer=3D<number>

In order to limit the = maximum number of reducers:

  set hive.exec.reducers.max=3D<number>

In order to set a = constant number of reducers:

  set mapred.reduce.tasks=3D<number>

Starting Job =3D job_201202131643_1295, Tracking URL =3D http://hadoop001:50030/jobdetails.jsp?jobid=3Djob_201202131643_1295

Kill Command =3D /usr/lib/hadoop/bin/hadoop job  = -Dmapred.job.tracker=3Dhadoop001:6932 -kill job_201202131643_1295

2012-04-10 = 16:16:56,693 Stage-2 map =3D 0%,  reduce =3D 0%

2012-04-10 = 16:17:02,716 Stage-2 map =3D 100%,  reduce =3D 0%

2012-04-10 = 16:17:12,759 Stage-2 map =3D 100%,  reduce =3D 100%

Ended Job =3D job_201202131643_1295

OK

1   &nbs= p;   678832  67850   678832.0

2   &nbs= p;   1360529 135253  1360529.0

3   &nbs= p;   4784635 460994  4784635.0

Time taken: 131.748 = seconds

hive>

From:= Bejoy Ks = [mailto:bejoy_ks@yahoo.com]
Sent: Tuesday, April 10, 2012 11:44 AM
To: user@hive.apache.org
Subject: [Marketing Mail] Re: Why BucketJoinMap consume too much = memory

 

Hi = Binh

      You = are right, here both of your tables are of the same size. And loading 2GB od = data into hash tables and then to temp files and so on would take some time. = This time becomes negligible if it was like, one table was of 2GB and other = of 2TB, then you'll notice the wide difference in performance between a common = join and bucketed map join.

      If = one of the table is too small map join would be good, if it is of moderate size = then bucketed map join. 

 <= /p>

Regards=

Bejoy = KS

 <= /p>


From: = binhnt22 <Binhnt22@viettel.com.vn> To: user@hive.apache.org
Cc: 'Bejoy Ks' <bejoy_ks@yahoo.com>
Sent: Tuesday, April 10, 2012 8:10 AM
Subject: RE: Why BucketJoinMap consume too much = memory

 

Hi Bejoy,

 

It worked like a charm. Thank you very much. I really = really appreciate your help.

 

This bucket join should be used with 1 big table and 1 = small table.

 

If both table are big, the join time would be much more = than normal join.

 

Best regards

Nguyen Thanh Binh = (Mr)

Cell phone: = (+84)98.= 226.0622

 

From: = Bejoy Ks [mailto:bejoy_ks@yahoo.com] =
Sent: Monday, April 09, 2012 9:49 PM
To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much = memory

 

Hi Binh

     = It is just an issue with the number of buckets. Your tables has just 8 = buckets, as there only 8 files are seen the storage directory. You might have just = issued an ALTER TABLE script on an existing bucketed table. The work around = here is

 

1) You need to wipe and = reload the tables with hive.enforce.bucketing=3Dtrue;

    &nb= sp;  Ensure your storage directory as that many files as the number of = buckets. As per your table DDL you should see 256 files.

 

2) Enable hive.optimize.bucketmapjoin =3D true; and try doing the join = again.

 

It should definitely = work.

 

Regards

Bejoy KS

 


From: = binhnt22 <Binhnt22@viettel.com.vn>
To: user@hive.apache.org
Cc: 'Bejoy Ks' <bejoy_ks@yahoo.com>
Sent: Monday, April 9, 2012 8:42 AM
Subject: RE: Why BucketJoinMap consume too much = memory

 

Hi Bejoy,


Thank you for helping me. Here is the information

 

1.      Describe = Formatted ra_md_syn;

# col_name           = ;   data_type          &nbs= p;    comment

 

calling         = ;        string           &= nbsp;      None

total_duration       &nbs= p;  bigint           &= nbsp;      None

total_volume        =     bigint           &= nbsp;      None

total_charge        =     bigint           &= nbsp;      None

 

# Detailed Table Information

Database:        &nb= sp;      default

Owner:         =          hduser

CreateTime:        &= nbsp;    Thu Apr 05 09:48:29 ICT 2012

LastAccessTime:       &nb= sp; UNKNOWN

Protect Mode:           = None

Retention:        &n= bsp;     0

Location:        &nb= sp;      hdfs://master:54310/user/hive/warehouse/ra_md_syn

Table Type:           &n= bsp; MANAGED_TABLE

Table Parameters:

        numFiles           = ;     8

        numPartitions          = 0

        numRows           =       0

        rawDataSize          &n= bsp;  0

        totalSize          &nbs= p;    1872165483

        transient_lastDdlTime   1333595095

 

# Storage Information

SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat:        =     org.apache.hadoop.mapred.TextInputFormat

OutputFormat:        = ;   org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputForm   = ;            =     at

Compressed:        &= nbsp;    No

Num Buckets:           = ; 256

Bucket = Columns:         [calling]

Sort Columns:           = []

Storage Desc Params:

        serialization.format    1

 

2.      Describe Formatted ra_ocs_syn;

# col_name           = ;   data_type          &nbs= p;    comment

 

calling         = ;        string           &= nbsp;      None

total_duration       &nbs= p;  bigint           &= nbsp;      None

total_volume        =     bigint           &= nbsp;      None

total_charge        =     bigint           &= nbsp;      None

 

# Detailed Table Information

Database:        &nb= sp;      default

Owner:         =          hduser

CreateTime:        &= nbsp;    Thu Apr 05 09:48:24 ICT 2012

LastAccessTime:       &nb= sp; UNKNOWN

Protect Mode:           = None

Retention:        &n= bsp;     0

Location:        &nb= sp;      hdfs://master:54310/user/hive/warehouse/ra_ocs_syn

Table Type:           &n= bsp; MANAGED_TABLE

Table Parameters:

        = numFiles               &= nbsp;8

        numPartitions          = 0

        numRows           =       0

        rawDataSize          &n= bsp;  0

        totalSize          &nbs= p;    1872162225

        transient_lastDdlTime   1333595512

 

# Storage Information

SerDe = Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat:        =     org.apache.hadoop.mapred.TextInputFormat

OutputFormat:        = ;   org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Compressed:        &= nbsp;    No

Num Buckets:           = ; 256

Bucket = Columns:         [calling]

Sort Columns:           = []

Storage Desc Params:

        serialization.format    1

 

3.      hadoop fs -du = <hdfs location of ra_md_syn >

[hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_md_syn

Found 8 items

280371407   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000000_1

280371407   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000001_0

274374970   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000002_1

274374970   = hdfs://master:54310/user/hive/warehouse/ra_md_syn/000003_1

269949415   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000004_0

262439067   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000005_0

205767721   hdfs://master:54310/user/hive/warehouse/ra_md_syn/000006_0

24516526    hdfs://master:54310/user/hive/warehouse/ra_md_syn/000007_0

 

4.      hadoop fs -du <hdfs location of  ra_ocs_syn >

[hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_ocs_syn

Found 8 items

314639270   = hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000000_0

314639270   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000001_0

304959363   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000002_0

274374381   = hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000003_1

264694474   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000004_0

257498693   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000005_1

100334604   hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000006_0

41022170    = hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000007_0

 

5.      hive-site.xml

[hduser@master ~]$ cat = hive-0.8.1/conf/hive-site.xml

<?xml version=3D"1.0"?>

<?xml-stylesheet type=3D"text/xsl" href=3D"configuration.xsl"?>

 

<configuration>

        = <property>

          =       <name>hive.metastore.local</name>

          =       <value>true</value>

        = </property>

        = <property>

        <name>javax.jdo.option.ConnectionURL</name>

        <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=3D= true</value>

        = </property>

        = <property>

        <name>javax.jdo.option.ConnectionDriverName</name>

        <value>com.mysql.jdbc.Driver</value>

        = </property>

        = <property>

        <name>javax.jdo.option.ConnectionUserName</name>

        = <value>hadoop</value>

        = </property>

        = <property>

        <name>javax.jdo.option.ConnectionPassword</name>

        <value>hadoop</value>

        = </property>

 

 

        = <property>

          <name>hive.metastor= e.sasl.enabled</name>

          <value>true</value>

          <description>If true, the metastore thrift interface will be = secured with SASL. Clients must authenticate with = Kerberos.</description>

        = </property>

 

        = <property>

          <name>hive.metastore.kerberos.keytab.file</name>

          <value></value>

          <description>The path to the Kerberos Keytab file containing the metastore thrift server's service = principal.</description>

        = </property>

 

        = <property>

          <name>hive.metastore.kerberos.principal</name>

          <value>hduser/admin@EXAMPLE.COM</value>

          <description>The service principal for the metastore thrift = server. The special string _HOST will be replaced automatically with the correct = host name.</description>

        = </property>

</configuration>

 

6.      Performing = JOIN

hive> set hive.optimize.bucketmapjoin =3D = true;

hive> set hive.enforce.bucketing=3Dtrue;

hive> set hive.input.format=3Dorg.apache.hadoop.hive.ql.io.BucketizedHiveInputForma= t;

hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b

    > on (a.calling =3D b.calling) = where  a.total_volume <> b.total_volume;

Total MapReduce jobs =3D 1

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is = deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.

Execution log at: /tmp/hduser/hduser_20120409013131_da40d787-597d-490c-8558-9d10ec11e916.lo= g

2012-04-09 01:31:24     Starting to = launch local task to process map join;      maximum = memory =3D 13            = ;            =    98145024

2012-04-09 01:31:28     Processing rows:        200000  Hashtable = size: 199999  Memory usage:   754              &= nbsp;           &n= bsp;05504   rate:   0.054

2012-04-09 01:31:29     Processing rows:        300000  Hashtable = size: 299999  Memory usage:   111           &nbs= p;            = ;   540296  rate:   0.08

2012-04-09 01:31:32     Processing rows:        400000  Hashtable = size: 399999  Memory usage:   151           &nbs= p;            = ;   640080  rate:   0.108

2012-04-09 01:31:35     Processing rows:        500000  Hashtable = size: 499999  Memory usage:   185           &nbs= p;            = ;   503416  rate:   0.133

2012-04-09 01:31:37     Processing rows:        600000  Hashtable = size: 599999  Memory usage:   221           &nbs= p;            = ;   503440  rate:   0.158

2012-04-09 01:31:42     Processing rows:        700000  Hashtable = size: 699999  Memory usage:   257           &nbs= p;            = ;   484264  rate:   0.184

2012-04-09 01:31:47     Processing rows:        800000  Hashtable = size: 799999  Memory usage:   297           &nbs= p;            = ;   678568  rate:   0.213

2012-04-09 01:31:52     Processing rows:        900000  Hashtable = size: 899999  Memory usage:   333           &nbs= p;            = ;   678592  rate:   0.239

2012-04-09 01:31:57     Processing = rows:        1000000 Hashtable size: 999999  Memory usage:   369           &nbs= p;            = ;   678568  rate:   0.264

2012-04-09 01:32:03     Processing rows:        1100000 Hashtable size: = 1099999 Memory usage:   405           &nbs= p;            = ;   678568  rate:   0.29

2012-04-09 01:32:09     Processing rows:        1200000 Hashtable size: = 1199999 Memory usage:   441           &nbs= p;            = ;   678592  rate:   0.316

2012-04-09 01:32:15     Processing rows:        1300000 Hashtable size: = 1299999 Memory usage:   477           &nbs= p;            = ;   678568  rate:   0.342

2012-04-09 01:32:23     Processing rows:        1400000 Hashtable size: = 1399999 Memory usage:   513           &nbs= p;               678592  rate:   = 0.367

2012-04-09 01:32:29     Processing rows:        1500000 Hashtable size: = 1499999 Memory usage:   549           &nbs= p;            = ;   678568  rate:   0.393

2012-04-09 01:32:35     Processing rows:        1600000 Hashtable size: = 1599999 Memory usage:   602           &nbs= p;            = ;   455824  rate:   0.431

2012-04-09 01:32:45     Processing rows:        1700000 Hashtable size: = 1699999 Memory usage:   630           &nbs= p;            = ;   067176  rate:   0.451

2012-04-09 01:32:53     Processing rows:        1800000 Hashtable size: = 1799999 Memory usage:   666           &nbs= p;            = ;   067176  rate:   0.476

2012-04-09 01:33:01     Processing rows:        1900000 Hashtable size: = 1899999 Memory usage:   702           &nbs= p;            = ;   067200  rate:   0.502

2012-04-09 01:33:09     Processing rows:        2000000 Hashtable size: = 1999999 Memory usage:   = 738           &nbs= p;            = ;   067176  rate:   0.528

2012-04-09 01:33:20     Processing rows:        2100000 Hashtable size: = 2099999 Memory usage:   774                    &= nbsp;      254456  rate:   0.554

2012-04-09 01:33:29     Processing rows:        2200000 Hashtable size: = 2199999 Memory usage:   810           &nbs= p;            = ;   067176  rate:   0.579

2012-04-09 01:33:38     Processing rows:        2300000 Hashtable size: = 2299999 Memory usage:   846           &nbs= p;            = ;   568480  rate:   0.605

2012-04-09 01:33:49     Processing rows:        2400000 Hashtable size: = 2399999 Memory usage:   882           &nbs= p;            = ;   096752  rate:   0.631

2012-04-09 01:33:59     Processing rows:        2500000 Hashtable size: = 2499999 Memory usage:   918           &nbs= p;            = ;   821920  rate:   0.657

2012-04-09 01:34:15     Processing rows:        2600000 Hashtable size: = 2599999 Memory usage:   954           &nbs= p;            = ;   134920  rate:   0.682

2012-04-09 01:34:26     Processing rows:        2700000 Hashtable size: = 2699999 Memory usage:   = 990                     &= nbsp;     067168  rate:   0.708

2012-04-09 01:34:38     Processing rows:        2800000 Hashtable size: = 2799999 Memory usage:   102           &nbs= p;            = ;   7113288 rate:   0.735

Exception in thread "Thread-2" = java.lang.OutOfMemoryError: Java heap space

        at java.util.Arrays.copyOf(Arrays.java:2882)

        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java= :100)

        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:597)

        at java.lang.StringBuilder.append(StringBuilder.java:212)

        at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:247)

        at org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:232)<= /span>

Exception in thread "Thread-1" = java.lang.OutOfMemoryError: Java heap space

        at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:315)

        at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:310)

        at java.util.jar.Manifest.read(Manifest.java:178)

        at java.util.jar.Manifest.<init>(Manifest.java:52)

        at java.util.jar.JarFile.getManifestFromReference(JarFile.java:167)

        at java.util.jar.JarFile.getManifest(JarFile.java:148)

        = at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:696)

        at java.net.URLClassLoader.defineClass(URLClassLoader.java:228)

        at java.net.URLClassLoader.access$000(URLClassLoader.java:58)

        at java.net.URLClassLoader$1.run(URLClassLoader.java:197)

        at = java.security.AccessController.doPrivileged(Native Method)

        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)

        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)

        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)

        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

        at org.apache.hadoop.util.RunJar$1.run(RunJar.java:126)

Execution failed with exit status: 2

Obtaining error information

 

Task failed!

Task ID:

  Stage-3

 

Logs:

 

/tmp/hduser/hive.log

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapredLocalTask

7.      /tmp/hduser/hive.log

2012-04-09 02:00:10,654 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.resources" but it cannot be = resolved.

2012-04-09 02:00:10,654 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.resources" but it cannot be = resolved.

2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.runtime" but it cannot be = resolved.

2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.runtime" but it cannot be = resolved.

2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.text" but it cannot be = resolved.

2012-04-09 02:00:10,657 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires "org.eclipse.text" but it cannot be = resolved.

2012-04-09 02:00:12,796 WARN  parse.SemanticAnalyzer (SemanticAnalyzer.java:genBodyPlan(5821)) - Common Gby = keys:null

2012-04-09 02:09:02,356 ERROR exec.Task (SessionState.java:printError(380)) - Execution failed with exit status: = 2

2012-04-09 02:09:02,357 ERROR exec.Task (SessionState.java:printError(380)) - Obtaining error = information

2012-04-09 02:09:02,358 ERROR exec.Task (SessionState.java:printError(380)) -

Task failed!

Task ID:

  Stage-3

 

Logs:

 

2012-04-09 02:09:02,358 ERROR exec.Task (SessionState.java:printError(380)) - /tmp/hduser/hive.log

2012-04-09 02:09:02,359 ERROR exec.MapredLocalTask (MapredLocalTask.java:execute(228)) - Execution failed with exit status: = 2

2012-04-09 02:09:02,377 ERROR ql.Driver (SessionState.java:printError(380)) - FAILED: Execution Error, return = code 2 from org.apache.hadoop.hive.ql.exec.MapredLocalTask

 

 

Best regards

Nguyen Thanh Binh = (Mr)

Cell phone: = (+84)98.= 226.0622

 

From: = Bejoy Ks [mailto:bejoy_ks@yahoo.com]
Sent: Friday, April 06, 2012 11:33 PM
To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much = memory

 

Hi Binh,

      =   From the information you provided bucketed map join should be posible. I'm = clueless now, but still I can make one more try, if you could provide me the = output of the following

 

1) Describe Formatted ra_md_syn;

2) Describe Formatted ra_ocs_syn.

 

3) hadoop fs -du <hdfs location of ra_ocs_syn >

4) hadoop fs -du <hdfs location of  ra_md_syn >

 

5) perform the join and = paste the full console log along with the query. (with all the properties set at = CLI)

 

6) your = hive-site.xml

 

@Alex

      =  You can use non equality conditions in the where clause. Only the ON = conditions should be equality ones.

 

 

Regards

Bejoy KS

 

 


From: = gemini alex <gemini5201314@gmail.com>
To: user@hive.apache.org
Cc: Bejoy Ks <bejoy_ks@yahoo.com>
Sent: Friday, April 6, 2012 12:36 PM
Subject: Re: Why BucketJoinMap consume too much = memory

 

I guess the problem is you can't using <> predicate in bucket join, try = to 

select c.* from (

select /*+ MAPJOIN(b) */ a.calling calling ,a. t= otal_volume atotal_volume , b.total_volume btotal_volume f= rom ra_md_syn a join ra_ocs_syn b

     on = (a.calling =3D b.calling) ) c where c.atotal_volumn<>c.btotal_volume = ;

 

 

 

=E5=9C=A8 = 2012=E5=B9=B44=E6=9C=886=E6=97=A5 =E4=B8=8A=E5=8D=889:19=EF=BC=8Cbinhnt22 <Binhnt22@viettel.com.vn>=E5=86=99=E9=81=93=EF=BC=9A

Hi Bejoy,

 

Sorry for late response. I will start to demonstrate over = again to clear some information.

 

I have 2 tables, nearly same. Both has the same table = structure, 65m records, 2GB size (same size).

hive> describe ra_md_syn;

OK

calling string

total_duration  bigint

total_volume    bigint

total_charge    bigint

 

Both of them were bucketized into 256 buckets on = =E2=80=98calling=E2=80=99 column (in the last time only 10 buckets, I tried to increase it as you suggested). And I want to find all =E2=80=98calling=E2=80=99 exists in = both tables but different =E2=80=98total_volume=E2=80=99

The script as you knew:

 

hive> set hive.optimize.bucketmapjoin =3D = true;

hive> set hive.enforce.bucketing=3Dtrue;

hive> set hive.input.format=3Dorg.apache.hadoop.hive.ql.io.CombineHiveInputFormat;<= /span>

hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b

    > on (a.calling =3D b.calling) = where  a.total_volume <> b.total_volume;

 

And the result was exactly in my last email. Java heap = space error. With total size is only 2GB and 256 buckets, I think bucket size = is impossible to be the issue here.

 

Please give me some advice, I really = appreciate

Best regards

Nguyen Thanh Binh (Mr)

Cell phone: (+84)98.226.0622

 

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]
Sent: Thursday, April 05, 2012 7:23 PM


To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much = memory

 

Hi Binh

 

    I was just checking your local map join log , = and I noticed two things 

- the memory usage by one hash table has got beyond = 1G. 

- Number of rows processed is just 2M

 

It is possible that, Each bucket it self is too = large to be loaded in memory.

 

As a work around or to nail down the bucket size is the = issue here, can you try increasing the number of buckets to 100 and try doing = a bucketed map join.

 

Also you mentioned the data size is 2Gb, is it the = compressed data size?

 

2012-04-05 10:41:07     Processing rows:        2,900,000 Hashtable size: 2899999 Memory usage:   1,062,065,576    &= nbsp; rate:   0.76

 

Regards

Bejoy KS

 

 

 


From: Nitin Pawar <nitinpawar432@gmail.com>
To: user@hive.apache.org
Sent: Thursday, April 5, 2012 5:03 PM
Subject: Re: Why BucketJoinMap consume too much = memory

 

Can you tell me the size of table b? 

 

If you are doing bucketing and still size b table is huge then it will reach = this problem

On Thu, Apr 5, 2012 at 4:22 PM, binhnt22 <Binhnt22@viettel.com.vn> = wrote:

Thank Nitin,

 

I tried but no luck. Here=E2=80=99s hive log, please = spend a little time to view it.

 

hive> set hive.optimize.bucketmapjoin =3D = true;

hive> set hive.enforce.bucketing=3Dtrue;

hive> set hive.input.format=3Dorg.apache.hadoop.hive.ql.io.CombineHiveInputFormat;<= /span>

hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b

    > on (a.calling =3D b.calling) = where  a.total_volume <> b.total_volume;

Total MapReduce jobs =3D 1

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all = the log4j.properties files.

Execution log at: /tmp/hduser/hduser_20120405103737_28ef26fe-a202-4047-b5ca-c40d9e3ad36c.lo= g

2012-04-05 10:37:45     Starting to = launch local task to process map join;      maximum = memory =3D 1398145024

2012-04-05 10:37:48     Processing rows:        200000  Hashtable = size: 199999  Memory usage:   75403880        rate:   = 0.054

2012-04-05 10:37:50     Processing rows:        300000  Hashtable = size: 299999  Memory usage:   111404664       rate:   = 0.08

2012-04-05 10:37:54     Processing rows:        400000  Hashtable = size: 399999  Memory usage:   151598960       rate:   = 0.108

2012-04-05 10:38:04     Processing rows:        500000  Hashtable = size: 499999  Memory usage:   185483368       rate:   = 0.133

2012-04-05 10:38:09     Processing rows:        600000  Hashtable = size: 599999  Memory usage:   221483392       rate:   = 0.158

2012-04-05 10:38:13     Processing rows:        700000  Hashtable = size: 699999  Memory usage:   257482640       rate:   = 0.184

2012-04-05 10:38:19     Processing rows:        800000  Hashtable = size: 799999  Memory usage:   297676944       rate:   = 0.213

2012-04-05 10:38:22     Processing = rows:        900000  Hashtable size: 899999  Memory usage:   333676968       rate:   = 0.239

2012-04-05 10:38:27     Processing rows:        1000000 Hashtable size: 999999  Memory usage:   369676944       rate:   = 0.264

2012-04-05 10:38:31     Processing = rows:        1100000 Hashtable size: 1099999 Memory usage:   405676968       rate:   = 0.29

2012-04-05 10:38:36     Processing rows:        1200000 Hashtable size: = 1199999 Memory usage:   441676944       rate:   0.316

2012-04-05 10:38:42     Processing rows:        1300000 Hashtable size: = 1299999 Memory usage:   477676944       rate:   0.342

2012-04-05 10:38:47     Processing rows:        1400000 Hashtable size: = 1399999 Memory usage:   513676968       rate:   0.367

2012-04-05 10:38:52     Processing rows:        1500000 Hashtable size: = 1499999 Memory usage:   549676944       rate:   0.393

2012-04-05 10:39:00     Processing rows:        1600000 Hashtable size: = 1599999 Memory usage:   602454200       rate:   0.431

2012-04-05 10:39:08     Processing rows:        1700000 Hashtable size: = 1699999 Memory usage:   630065552       rate:   0.451

2012-04-05 10:39:14     Processing rows:        1800000 Hashtable size: = 1799999 Memory usage:   666065552       rate:   0.476

2012-04-05 10:39:20     Processing rows:        1900000 Hashtable size: = 1899999 Memory usage:   702065552       rate:   0.502

2012-04-05 10:39:26     Processing rows:        2000000 Hashtable size: = 1999999 Memory usage:   738065576       rate:   0.528

2012-04-05 10:39:36     Processing rows:        2100000 Hashtable size: = 2099999 Memory usage:   774065552       rate:   0.554

2012-04-05 10:39:43     Processing rows:        2200000 Hashtable size: = 2199999 Memory usage:   810065552       rate:   0.579

2012-04-05 10:39:51     Processing rows:        2300000 Hashtable size: = 2299999 Memory usage:   846065576       rate:   0.605

2012-04-05 10:40:16     Processing rows:        2400000 Hashtable size: = 2399999 Memory usage:   882085136       rate:   0.631

2012-04-05 10:40:24     Processing rows:        2500000 Hashtable size: = 2499999 Memory usage:   918085208       rate:   0.657

2012-04-05 10:40:39     Processing rows:        2600000 Hashtable size: = 2599999 Memory usage:   954065544       rate:   0.682

2012-04-05 10:40:48     Processing rows:        2700000 Hashtable size: = 2699999 Memory usage:   990065568       rate:   0.708

2012-04-05 10:40:56     Processing rows:        2800000 Hashtable size: = 2799999 Memory usage:   1026065552      rate:   0.734

2012-04-05 10:41:07     Processing rows:        2900000 Hashtable size: = 2899999 Memory usage:   1062065576      rate:   0.76

Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space

 

Best regards

Nguyen Thanh Binh (Mr)

Cell phone: (+84)98.226.0622

 

From: Nitin Pawar [mailto:nitinpawar432@gmail.com]
Sent: Thursday, April 05, 2012 5:36 PM


To: user@hive.apache.org
Subject: Re: Why BucketJoinMap consume too much = memory

 

can you try adding these settings 

set hive.enforce.bucketing=3Dtrue;

hive.input.format=3Dorg.apache.hadoop.hive.ql.io.Co= mbineHiveInputFormat;

 

I have tried bucketing with 1000 buckets and with more than 1TB data tables .. = they do go through fine 

 

 

On Thu, Apr 5, 2012 at 3:37 PM, binhnt22 <Binhnt22@viettel.com.vn> = wrote:

Hi Bejoy,

 

Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) = and bucketized on =E2=80=98calling=E2=80=99 column into 10 = buckets.

 

As you said, hive will load only 1 bucket ~ 180-190MB = into memory. That=E2=80=99s hardly to blow the heap (1.3GB)

 

According to wiki, I set:

 

  set hive.input.format=3Dorg.apache.hadoop.hive.ql.io.BucketizedHiveInputForma= t;

  set hive.optimize.bucketmapjoin =3D = true;

  set hive.optimize.bucketmapjoin.sortedmerge =3D = true;

 

And run the following SQL

 

select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a = join ra_ocs_cdr_ggsn_synthetic b

on (a.calling =3D b.calling) where  a.total_volume = <> b.total_volume;

 

But it still created many hash tables then threw Java = Heap space error

 

Best regards

Nguyen Thanh Binh (Mr)

Cell phone: (+84)98.226.0622

 

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]
Sent: Thursday, April 05, 2012 3:07 PM
To: user@hive.apache.org


Subject: Re: Why BucketJoinMap consume too much = memory

 

Hi Amit

 

      Sorry for the delayed response, had a terrible schedule. AFAIK, there is no flags that would help you to take = the hash table creation, compression and load into tmp files away from = client node. 

      From my understanding if you use a Map = side join, the small table as a whole is converted into a hash table and = compressed in a tmp file. Say if your child jvm size is 1gb and this small table is = 5GB, it'd blow off jour job if the map tasks tries to get such a huge file in = memory. Bucketed map join can help here, if the table is bucketed ,say 100 = buckets then each bucket may have around 50mb of data. ie one tmp file would be just = less that 50mb, here mapper needs to load only the required buckets in memory and thus hardly run = into memory issues.

    Also on the client, The records are processed = bucket by bucket and loaded into tmp files. So if your bucket size is too = large, than the heap size specified for your client, it'd throw an out of = memory.

 

Regards

Bejoy KS

 


From: Amit Sharma <amitsharma1708@gmail.com>
To: user@hive.apache.org; Bejoy Ks <bejoy_ks@yahoo.com>
Sent: Tuesday, April 3, 2012 11:06 PM
Subject: Re: Why BucketJoinMap consume too much = memory

 

I am experiencing similar behavior in my queries. All the conditions for = bucketed map join are met, and the only difference in execution when i set the hive.optimize.bucketmapjoin flag to true, is that instead of a single = hash table, multiple hash tables are created. All the Hash Tables are still = created on the client side and loaded into tmp files, which are then distributed = to the mappers using distributed cache.

Can i find any example anywhere, which shows behavior of bucketed map = join, where in it does not create the has tables on the client itself? If so, = is there a flag for it?

Thanks,
Amit

On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <bejoy_ks@yahoo.com> = wrote:

Hi
    On a first look, it seems like map join is = happening in your case other than bucketed map join. The following conditions need to = hold for bucketed map join to work
1) Both the tables are bucketed on the join columns
2) The number of buckets in each table should be multiples of each = other
3) Ensure that the table has enough number of buckets

Note: If the data is large say 1TB(per table) and if you have just a few buckets say 100 buckets, each mapper may have to load 10GB>. This = would definitely blow your jvm . Bottom line is ensure your mappers are not = heavily loaded with the bucketed data distribution.

Regards
Bejoy.K.S


From: binhnt22 <Binhnt22@viettel.com.vn>
To: user@hive.apache.org
Sent: Saturday, March 31, 2012 6:46 AM
Subject: Why BucketJoinMap consume too much = memory

 

I  have 2 table, each has 6 million records and clustered into 10 = buckets

 

These tables are very simple with 1 key column and 1 value column, all I want = is getting the key that exists in both table but different = value.

 

The normal did the trick, took only 141 secs.

 

select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling =3D b.calling) where  a.total_volume <> b.total_volume;

 

I tried to use bucket join map by setting:   set hive.optimize.bucketmapjoin = =3D true

 

select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a = join ra_ocs_cdr_ggsn_synthetic b on (a.calling =3D b.calling) where  a.total_volume <> b.total_volume;

 

2012-03-30 11:35:09     Starting to launch local task to = process map join;      maximum memory =3D = 1398145024

2012-03-30 11:35:12     Processing rows:        200000  Hashtable = size: 199999  Memory usage:   86646704        rate:   = 0.062

2012-03-30 11:35:15     Processing rows:        300000  Hashtable = size: 299999  Memory usage:   128247464       rate:   = 0.092

2012-03-30 11:35:18     Processing rows:        400000  Hashtable = size: 399999  Memory usage:   174041744       rate:   = 0.124

2012-03-30 11:35:21     Processing rows:        500000  Hashtable = size: 499999  Memory usage:   214140840       rate:   = 0.153

2012-03-30 11:35:25     Processing rows:        600000  Hashtable = size: 599999  Memory usage:   255181504       rate:   = 0.183

2012-03-30 11:35:29     Processing = rows:        700000  Hashtable size: 699999  Memory usage:   296744320       rate:   = 0.212

2012-03-30 11:35:35     Processing rows:        800000  Hashtable = size: 799999  Memory usage:   342538616       rate:   = 0.245

2012-03-30 11:35:38     Processing = rows:        900000  Hashtable size: 899999  Memory usage:   384138552       rate:   = 0.275

2012-03-30 11:35:45     Processing rows:        1000000 Hashtable size: 999999  Memory usage:   425719576       rate:   = 0.304

2012-03-30 11:35:50     Processing rows:        1100000 Hashtable size: = 1099999 Memory usage:   467319576       rate:   0.334

2012-03-30 11:35:56     Processing rows:        1200000 Hashtable size: = 1199999 Memory usage:   508940504       rate:   0.364

2012-03-30 11:36:04     Processing rows:        1300000 Hashtable size: = 1299999 Memory usage:   550521128       rate:   0.394

2012-03-30 11:36:09     Processing rows:        1400000 Hashtable size: = 1399999 Memory usage:   592121128       rate:   0.424

2012-03-30 11:36:15     Processing rows:        1500000 Hashtable size: = 1499999 Memory usage:   633720336       rate:   0.453

2012-03-30 11:36:22     Processing rows:        1600000 Hashtable size: = 1599999 Memory usage:   692097568       rate:   0.495

2012-03-30 11:36:33     Processing rows:        1700000 Hashtable size: = 1699999 Memory usage:   725308944       rate:   0.519

2012-03-30 11:36:40     Processing rows:        1800000 Hashtable size: = 1799999 Memory usage:   766946424       rate:   0.549

2012-03-30 11:36:48     Processing rows:        1900000 Hashtable size: = 1899999 Memory usage:   808527928       rate:   0.578

2012-03-30 11:36:55     Processing rows:        2000000 Hashtable size: = 1999999 Memory usage:   850127928       rate:   0.608

2012-03-30 11:37:08     Processing rows:        2100000 Hashtable size: = 2099999 Memory usage:   891708856       rate:   0.638

2012-03-30 11:37:16     Processing rows:        2200000 Hashtable size: = 2199999 Memory usage:   933308856       rate:   0.668

2012-03-30 11:37:25     Processing rows:        2300000 Hashtable size: = 2299999 Memory usage:   974908856       rate:   0.697

2012-03-30 11:37:34     Processing rows:        2400000 Hashtable size: = 2399999 Memory usage:   1016529448      rate:   0.727

2012-03-30 11:37:43     Processing rows:        2500000 Hashtable size: = 2499999 Memory usage:   1058129496      rate:   0.757

2012-03-30 11:37:58     Processing rows:        2600000 Hashtable size: = 2599999 Memory usage:   1099708832      rate:   0.787

Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap = space

 

My system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them containts NameNode, JobTracker, Hive Server and all of them contain DataNode, = TaskTracker

 

In all node, I set: export HADOOP_HEAPSIZE=3D1500 in = hadoop-env.sh (~ 1.3GB heap)

 

I want to ask you experts, why bucket join map consume too much memory? Am I wrong = or my configuration is bad?

 

Best regards,

 

 

 

 



 

--
Nitin Pawar



 

--
Nitin Pawar

 

 

 

 

 

--Boundary_(ID_Xs6DU59ehoTRUXpe5DR+0g)--