spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From linxi zeng <>
Subject Re: spark sql job create too many files in HDFS when doing insert overwrite hive table
Date Thu, 28 Apr 2016 10:09:49 GMT
BTW, I have created a JIRA task to follow this issue:

2016-04-28 18:08 GMT+08:00 linxi zeng <>:

> Hi,
> Recently, we often encounter problems using spark sql for inserting data
> into a partition table (ex.: insert overwrite table $output_table
> partition(dt) select xxx from tmp_table).
> After the spark job start running on yarn, *the app will create too many
> files (ex. 200w+, or even 1000w+), which will make HDFS under enormous
> pressure*.
> We found that the num of files created by spark job is depending on the
> partition num of hive table that will be inserted and the num of spark sql
> partitions.
> *files_num = hive_table_partions_num * spark_sql_partitions_num*.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions)
> >= 1000, and the hive_table_partions_num is very small under normal
> circumstances, but it will turn out to be more than 2000 when we input a
> wrong field as the partion field unconsciously, which will make the
> files_num >= 1000 * 2000 = 200w.
> There is a configuration parameter in hive that can limit the maximum
> number of dynamic partitions allowed to be created in each mapper/reducer
> named *hive.exec.max.dynamic.partitions.pernode*, but this conf parameter
> did't work when we use hiveContext.
> Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make
> the files_num be smaller, but it will affect the concurrency.
> Can we create configuration parameters to  limit the maximum number of
> files allowed to be create by each task or limit the
> spark_sql_partitions_num without affect the concurrency?

View raw message