Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Mon, 30 Mar 2015 22:01:53 +0000 (UTC)
From: =?utf-8?Q?Sergio_Pe=C3=B1a_=28JIRA=29?= <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.12786846.1427752491000.73415.1427752913452@Atlassian.JIRA>
In-Reply-To: <JIRA.12786846.1427752491000@Atlassian.JIRA>
References: <JIRA.12786846.1427752491000@Atlassian.JIRA>
 <JIRA.12786846.1427752491307@arcas>
Subject: [jira] [Updated] (HIVE-10149) Shuffle Hive data before storing in
 Parquet
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/HIVE-10149?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergio Pe=C3=B1a updated HIVE-10149:
-------------------------------
    Description:=20
Hive can run into OOM (Out Of Memory) exceptions when writing many dynamic =
partitions to parquet because it creates too many open files at once and Pa=
rquet buffers an entire row group of data in memory for each open file. To =
avoid this in ORC, HIVE-6455 shuffles data for each partition so only one f=
ile is open at a time. We need to extend this support to Parquet and possib=
ly the MR and Spark planners.

Steps to reproduce:

1. Create a table and load some data that contains many many partitions (fi=
le 'data.txt' attached on this ticket).

{code}
hive> create table t1_stage(id bigint, rdate string) row format delimited f=
ields terminated by ' ';

hive> load data local inpath 'data.txt' into table t1_stage;
{code}

2. Create a Parquet table, and insert partitioned data from the t1_stage ta=
ble.

{noformat}
hive> set hive.exec.dynamic.partition.mode=3Dnonstrict;

hive> create table t1_part(id bigint) partitioned by (rdate string) stored =
as parquet;

hive> insert overwrite table t1_part partition(rdate) select * from t1_stag=
e;
Query ID =3D sergio_20150330163713_db3afe74-d1c7-4f0d-a8f1-f2137ddb64a4
Total jobs =3D 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job =3D job_1427748520315_0006, Tracking URL =3D http://victory:80=
88/proxy/application_1427748520315_0006/
Kill Command =3D /opt/local/hadoop/bin/hadoop job  -kill job_1427748520315_=
0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducer=
s: 0
2015-03-30 16:37:19,065 Stage-1 map =3D 0%,  reduce =3D 0%
2015-03-30 16:37:43,947 Stage-1 map =3D 100%,  reduce =3D 0%
Ended Job =3D job_1427748520315_0006 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1427748520315_0006_m_000000 (and more) from job job=
_1427748520315_0006

Task with the most failures(4):=20
-----
Task ID:
  task_1427748520315_0006_m_000000

URL:
  http://0.0.0.0:8088/taskdetails.jsp?jobid=3Djob_1427748520315_0006&tipid=
=3Dtask_1427748520315_0006_m_000000
-----
Diagnostic Messages for this Task:
Error: Java heap space

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.=
mr.MapRedTask
MapReduce Jobs Launched:=20
Stage-Stage-1: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
{noformat}

  was:
Hive can run into OOM (Out Of Memory) exceptions when writing many dynamic =
partitions to parquet because it creates too many open files at once and Pa=
rquet buffers an entire row group of data in memory for each open file. To =
avoid this in ORC, HIVE-6455 shuffles data for each partition so only one f=
ile is open at a time. We need to extend this support to Parquet and possib=
ly the MR and Spark planners.

Steps to reproduce:

1. Create a table and load some data that contains many many partitions (at=
tached a file with sample data).

{code}
hive> create table t1_stage(id bigint, rdate string) row format delimited f=
ields terminated by ' ';

hive> load data local inpath 'data.txt' into table t1_stage;
{code}

2. Create a Parquet table, and insert partitioned data from the t1_stage ta=
ble.

{noformat}
hive> set hive.exec.dynamic.partition.mode=3Dnonstrict;

hive> create table t1_part(id bigint) partitioned by (rdate string) stored =
as parquet;

hive> insert overwrite table t1_part partition(rdate) select * from t1_stag=
e;
Query ID =3D sergio_20150330163713_db3afe74-d1c7-4f0d-a8f1-f2137ddb64a4
Total jobs =3D 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job =3D job_1427748520315_0006, Tracking URL =3D http://victory:80=
88/proxy/application_1427748520315_0006/
Kill Command =3D /opt/local/hadoop/bin/hadoop job  -kill job_1427748520315_=
0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducer=
s: 0
2015-03-30 16:37:19,065 Stage-1 map =3D 0%,  reduce =3D 0%
2015-03-30 16:37:43,947 Stage-1 map =3D 100%,  reduce =3D 0%
Ended Job =3D job_1427748520315_0006 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1427748520315_0006_m_000000 (and more) from job job=
_1427748520315_0006

Task with the most failures(4):=20
-----
Task ID:
  task_1427748520315_0006_m_000000

URL:
  http://0.0.0.0:8088/taskdetails.jsp?jobid=3Djob_1427748520315_0006&tipid=
=3Dtask_1427748520315_0006_m_000000
-----
Diagnostic Messages for this Task:
Error: Java heap space

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.=
mr.MapRedTask
MapReduce Jobs Launched:=20
Stage-Stage-1: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
{noformat}


> Shuffle Hive data before storing in Parquet
> -------------------------------------------
>
>                 Key: HIVE-10149
>                 URL: https://issues.apache.org/jira/browse/HIVE-10149
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 1.1.0
>            Reporter: Sergio Pe=C3=B1a
>         Attachments: data.txt
>
>
> Hive can run into OOM (Out Of Memory) exceptions when writing many dynami=
c partitions to parquet because it creates too many open files at once and =
Parquet buffers an entire row group of data in memory for each open file. T=
o avoid this in ORC, HIVE-6455 shuffles data for each partition so only one=
 file is open at a time. We need to extend this support to Parquet and poss=
ibly the MR and Spark planners.
> Steps to reproduce:
> 1. Create a table and load some data that contains many many partitions (=
file 'data.txt' attached on this ticket).
> {code}
> hive> create table t1_stage(id bigint, rdate string) row format delimited=
 fields terminated by ' ';
> hive> load data local inpath 'data.txt' into table t1_stage;
> {code}
> 2. Create a Parquet table, and insert partitioned data from the t1_stage =
table.
> {noformat}
> hive> set hive.exec.dynamic.partition.mode=3Dnonstrict;
> hive> create table t1_part(id bigint) partitioned by (rdate string) store=
d as parquet;
> hive> insert overwrite table t1_part partition(rdate) select * from t1_st=
age;
> Query ID =3D sergio_20150330163713_db3afe74-d1c7-4f0d-a8f1-f2137ddb64a4
> Total jobs =3D 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job =3D job_1427748520315_0006, Tracking URL =3D http://victory:=
8088/proxy/application_1427748520315_0006/
> Kill Command =3D /opt/local/hadoop/bin/hadoop job  -kill job_142774852031=
5_0006
> Hadoop job information for Stage-1: number of mappers: 1; number of reduc=
ers: 0
> 2015-03-30 16:37:19,065 Stage-1 map =3D 0%,  reduce =3D 0%
> 2015-03-30 16:37:43,947 Stage-1 map =3D 100%,  reduce =3D 0%
> Ended Job =3D job_1427748520315_0006 with errors
> Error during job, obtaining debugging information...
> Examining task ID: task_1427748520315_0006_m_000000 (and more) from job j=
ob_1427748520315_0006
> Task with the most failures(4):=20
> -----
> Task ID:
>   task_1427748520315_0006_m_000000
> URL:
>   http://0.0.0.0:8088/taskdetails.jsp?jobid=3Djob_1427748520315_0006&tipi=
d=3Dtask_1427748520315_0006_m_000000
> -----
> Diagnostic Messages for this Task:
> Error: Java heap space
> FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exe=
c.mr.MapRedTask
> MapReduce Jobs Launched:=20
> Stage-Stage-1: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
> Total MapReduce CPU Time Spent: 0 msec
> {noformat}


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)