spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
Date Wed, 28 Nov 2018 10:04:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-26164:
------------------------------------

    Assignee: Apache Spark

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-26164
>                 URL: https://issues.apache.org/jira/browse/SPARK-26164
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Cheng Su
>            Assignee: Apache Spark
>            Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on partition/bucket
columns [1]. The disadvantage is the sort might waste reserved CPU time on executor due to
spill. Hive does not require the local sort before writing output table [2], and we saw performance
regression when migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and output writer.
In case of writing row to a new file path, we create a new output writer. Otherwise, re-use
the same output writer if the writer already exists (mainly change should be in FileFormatDataWriter.scala).
This is very similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) consumes more
memory on executor (multiple output writer needs to be opened in same time), than the current
behavior (i.e. only one output writer opened). We can add the config to switch between the
current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message