Return-Path: X-Original-To: apmail-hive-issues-archive@minotaur.apache.org Delivered-To: apmail-hive-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9DF3917226 for ; Mon, 30 Mar 2015 22:01:53 +0000 (UTC) Received: (qmail 98702 invoked by uid 500); 30 Mar 2015 22:01:53 -0000 Delivered-To: apmail-hive-issues-archive@hive.apache.org Received: (qmail 98681 invoked by uid 500); 30 Mar 2015 22:01:53 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 98671 invoked by uid 99); 30 Mar 2015 22:01:53 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Mar 2015 22:01:53 +0000 Date: Mon, 30 Mar 2015 22:01:53 +0000 (UTC) From: =?utf-8?Q?Sergio_Pe=C3=B1a_=28JIRA=29?= To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HIVE-10149) Shuffle Hive data before storing in Parquet MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-10149?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Pe=C3=B1a updated HIVE-10149: ------------------------------- Description:=20 Hive can run into OOM (Out Of Memory) exceptions when writing many dynamic = partitions to parquet because it creates too many open files at once and Pa= rquet buffers an entire row group of data in memory for each open file. To = avoid this in ORC, HIVE-6455 shuffles data for each partition so only one f= ile is open at a time. We need to extend this support to Parquet and possib= ly the MR and Spark planners. Steps to reproduce: 1. Create a table and load some data that contains many many partitions (fi= le 'data.txt' attached on this ticket). {code} hive> create table t1_stage(id bigint, rdate string) row format delimited f= ields terminated by ' '; hive> load data local inpath 'data.txt' into table t1_stage; {code} 2. Create a Parquet table, and insert partitioned data from the t1_stage ta= ble. {noformat} hive> set hive.exec.dynamic.partition.mode=3Dnonstrict; hive> create table t1_part(id bigint) partitioned by (rdate string) stored = as parquet; hive> insert overwrite table t1_part partition(rdate) select * from t1_stag= e; Query ID =3D sergio_20150330163713_db3afe74-d1c7-4f0d-a8f1-f2137ddb64a4 Total jobs =3D 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job =3D job_1427748520315_0006, Tracking URL =3D http://victory:80= 88/proxy/application_1427748520315_0006/ Kill Command =3D /opt/local/hadoop/bin/hadoop job -kill job_1427748520315_= 0006 Hadoop job information for Stage-1: number of mappers: 1; number of reducer= s: 0 2015-03-30 16:37:19,065 Stage-1 map =3D 0%, reduce =3D 0% 2015-03-30 16:37:43,947 Stage-1 map =3D 100%, reduce =3D 0% Ended Job =3D job_1427748520315_0006 with errors Error during job, obtaining debugging information... Examining task ID: task_1427748520315_0006_m_000000 (and more) from job job= _1427748520315_0006 Task with the most failures(4):=20 ----- Task ID: task_1427748520315_0006_m_000000 URL: http://0.0.0.0:8088/taskdetails.jsp?jobid=3Djob_1427748520315_0006&tipid= =3Dtask_1427748520315_0006_m_000000 ----- Diagnostic Messages for this Task: Error: Java heap space FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.= mr.MapRedTask MapReduce Jobs Launched:=20 Stage-Stage-1: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 0 msec {noformat} was: Hive can run into OOM (Out Of Memory) exceptions when writing many dynamic = partitions to parquet because it creates too many open files at once and Pa= rquet buffers an entire row group of data in memory for each open file. To = avoid this in ORC, HIVE-6455 shuffles data for each partition so only one f= ile is open at a time. We need to extend this support to Parquet and possib= ly the MR and Spark planners. Steps to reproduce: 1. Create a table and load some data that contains many many partitions (at= tached a file with sample data). {code} hive> create table t1_stage(id bigint, rdate string) row format delimited f= ields terminated by ' '; hive> load data local inpath 'data.txt' into table t1_stage; {code} 2. Create a Parquet table, and insert partitioned data from the t1_stage ta= ble. {noformat} hive> set hive.exec.dynamic.partition.mode=3Dnonstrict; hive> create table t1_part(id bigint) partitioned by (rdate string) stored = as parquet; hive> insert overwrite table t1_part partition(rdate) select * from t1_stag= e; Query ID =3D sergio_20150330163713_db3afe74-d1c7-4f0d-a8f1-f2137ddb64a4 Total jobs =3D 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job =3D job_1427748520315_0006, Tracking URL =3D http://victory:80= 88/proxy/application_1427748520315_0006/ Kill Command =3D /opt/local/hadoop/bin/hadoop job -kill job_1427748520315_= 0006 Hadoop job information for Stage-1: number of mappers: 1; number of reducer= s: 0 2015-03-30 16:37:19,065 Stage-1 map =3D 0%, reduce =3D 0% 2015-03-30 16:37:43,947 Stage-1 map =3D 100%, reduce =3D 0% Ended Job =3D job_1427748520315_0006 with errors Error during job, obtaining debugging information... Examining task ID: task_1427748520315_0006_m_000000 (and more) from job job= _1427748520315_0006 Task with the most failures(4):=20 ----- Task ID: task_1427748520315_0006_m_000000 URL: http://0.0.0.0:8088/taskdetails.jsp?jobid=3Djob_1427748520315_0006&tipid= =3Dtask_1427748520315_0006_m_000000 ----- Diagnostic Messages for this Task: Error: Java heap space FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.= mr.MapRedTask MapReduce Jobs Launched:=20 Stage-Stage-1: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 0 msec {noformat} > Shuffle Hive data before storing in Parquet > ------------------------------------------- > > Key: HIVE-10149 > URL: https://issues.apache.org/jira/browse/HIVE-10149 > Project: Hive > Issue Type: Improvement > Affects Versions: 1.1.0 > Reporter: Sergio Pe=C3=B1a > Attachments: data.txt > > > Hive can run into OOM (Out Of Memory) exceptions when writing many dynami= c partitions to parquet because it creates too many open files at once and = Parquet buffers an entire row group of data in memory for each open file. T= o avoid this in ORC, HIVE-6455 shuffles data for each partition so only one= file is open at a time. We need to extend this support to Parquet and poss= ibly the MR and Spark planners. > Steps to reproduce: > 1. Create a table and load some data that contains many many partitions (= file 'data.txt' attached on this ticket). > {code} > hive> create table t1_stage(id bigint, rdate string) row format delimited= fields terminated by ' '; > hive> load data local inpath 'data.txt' into table t1_stage; > {code} > 2. Create a Parquet table, and insert partitioned data from the t1_stage = table. > {noformat} > hive> set hive.exec.dynamic.partition.mode=3Dnonstrict; > hive> create table t1_part(id bigint) partitioned by (rdate string) store= d as parquet; > hive> insert overwrite table t1_part partition(rdate) select * from t1_st= age; > Query ID =3D sergio_20150330163713_db3afe74-d1c7-4f0d-a8f1-f2137ddb64a4 > Total jobs =3D 3 > Launching Job 1 out of 3 > Number of reduce tasks is set to 0 since there's no reduce operator > Starting Job =3D job_1427748520315_0006, Tracking URL =3D http://victory:= 8088/proxy/application_1427748520315_0006/ > Kill Command =3D /opt/local/hadoop/bin/hadoop job -kill job_142774852031= 5_0006 > Hadoop job information for Stage-1: number of mappers: 1; number of reduc= ers: 0 > 2015-03-30 16:37:19,065 Stage-1 map =3D 0%, reduce =3D 0% > 2015-03-30 16:37:43,947 Stage-1 map =3D 100%, reduce =3D 0% > Ended Job =3D job_1427748520315_0006 with errors > Error during job, obtaining debugging information... > Examining task ID: task_1427748520315_0006_m_000000 (and more) from job j= ob_1427748520315_0006 > Task with the most failures(4):=20 > ----- > Task ID: > task_1427748520315_0006_m_000000 > URL: > http://0.0.0.0:8088/taskdetails.jsp?jobid=3Djob_1427748520315_0006&tipi= d=3Dtask_1427748520315_0006_m_000000 > ----- > Diagnostic Messages for this Task: > Error: Java heap space > FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exe= c.mr.MapRedTask > MapReduce Jobs Launched:=20 > Stage-Stage-1: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL > Total MapReduce CPU Time Spent: 0 msec > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)