hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xu, Cheng A" <cheng.a...@intel.com>
Subject RE: [Hive] Slow Loading Data Process with Parquet over 30k Partitions
Date Tue, 14 Apr 2015 05:40:58 GMT
Hi Tianqi,
Can you attach hive.log as more detailed information?
+Sergio

Yours,
Ferdinand Xu

From: Tianqi Tong [mailto:ttong@brightedge.com]
Sent: Friday, April 10, 2015 1:34 AM
To: user@hive.apache.org
Subject: [Hive] Slow Loading Data Process with Parquet over 30k Partitions

Hello Hive,
I'm a developer using Hive to process TB level data, and I'm having some difficulty loading
the data to table.
I have 2 tables now:

-- table_1:
CREATE EXTERNAL TABLE `table_1`(
  `keyword` string,
  `domain` string,
  `url` string
  )
PARTITIONED BY (yearmonth INT, partition1 STRING)
STORED AS RCfile

-- table_2:
CREATE EXTERNAL TABLE `table_2`(
  `keyword` string,
  `domain` string,
  `url` string
  )
PARTITIONED BY (yearmonth INT, partition2 STRING)
STORED AS Parquet

I'm doing an INSERT OVERWRITE to table_2 from SELECT FROM table_1 with dynamic partitioning,
and the number of partitions grows dramatically from 1500 to 40k (because I want to use something
else as partitioning).
The mapreduce job was fine.
Somehow the process stucked at " Loading data to table default.table_2 (yearmonth=null, domain_prefix=null)
", and I've been waiting for hours.

Is this expected when we have 40k partitions?

--------------------------------------------------------------
Refs - Here are the parameters that I used:
export HADOOP_HEAPSIZE=16384
set PARQUET_FILE_SIZE=268435456;
set parquet.block.size=268435456;
set dfs.blocksize=268435456;
set parquet.compression=SNAPPY;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=500000;
SET hive.exec.max.dynamic.partitions.pernode=50000;
SET hive.exec.max.created.files=1000000;


Thank you very much!
Tianqi Tong

Mime
View raw message