spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Resolved] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table
Date Wed, 17 May 2017 13:36:04 GMT


Sean Owen resolved SPARK-14974.
    Resolution: Not A Problem

> spark sql job create too many files in HDFS when doing insert overwrite hive table
> ----------------------------------------------------------------------------------
>                 Key: SPARK-14974
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.5.2
>            Reporter: zenglinxi
>            Priority: Minor
> Recently, we often encounter problems using spark sql for inserting data into a partition
table (ex.: insert overwrite table $output_table partition(dt) select xxx from tmp_table).
> After the spark job start running on yarn, the app will create too many files (ex. 2,000,000,
or even 10,000,000), which will make HDFS under enormous pressure.
> We found that the num of files created by spark job is depending on the partition num
of hive table that will be inserted and the num of spark sql partitions. 
> files_num = hive_table_partions_num *  spark_sql_partitions_num.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 1000,
and the hive_table_partions_num is very small under normal circumstances, but it will turn
out to be more than 2000 when we input a wrong field as the partion field unconsciously, which
will make the files_num >= 1000 * 2000 = 2,000,000.
> There is a configuration parameter in hive that can limit the maximum number of dynamic
partitions allowed to be created in each mapper/reducer named hive.exec.max.dynamic.partitions.pernode,
but this conf parameter did't work when we use hiveContext.
> Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the files_num
be smaller, but it will affect the concurrency.
> Can we create configuration parameters to  limit the maximum number of files allowed
to be create by each task or limit the spark_sql_partitions_num without affect the concurrency?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message