hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Udit Mehrotra (Jira)" <j...@apache.org>
Subject [jira] [Created] (HUDI-656) Write Performance - Driver spends too much time creating Parquet DataSource after writes
Date Thu, 05 Mar 2020 01:18:00 GMT
Udit Mehrotra created HUDI-656:
----------------------------------

             Summary: Write Performance - Driver spends too much time creating Parquet DataSource
after writes
                 Key: HUDI-656
                 URL: https://issues.apache.org/jira/browse/HUDI-656
             Project: Apache Hudi (incubating)
          Issue Type: Improvement
          Components: Performance, Spark Integration
            Reporter: Udit Mehrotra


h2. Problem Statement

We have noticed this performance bottleneck at EMR, and it has been reported here as well
[https://github.com/apache/incubator-hudi/issues/1371]

Hudi for writes through DataSource API uses [this|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L85]
to create the spark relation. Here it uses HoodieSparkSqlWriter to write the dataframe and
after it tries to [return|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L92]
a relation by creating it through parquet data source [here|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L72]

In the process of creating this parquet data source, Spark creates an *InMemoryFileIndex* [here|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L371]
as part of which it performs file listing of the base path. While the listing itself is [parallelized|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L289],
the filter that we pass which is *HoodieROTablePathFilter* is applied [sequentially|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L294]
on the driver side on all the 1000s of files returned during listing. This part is not parallelized
by spark, and it takes a lot of time probably because of the filters logic. This causes the
driver to just spend time filtering. We have seen it take 10-12 minutes to do this process
for just 50 partitions in S3, and this time is spent after the writing has finished.

Solving this will significantly reduce the writing time across all sorts of writes. This time
is essentially getting wasted, because we do not really have to return a relation after the
write. This relation is never really used by Spark either ways [here|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala#L45]
and writing process returns empty set of rows..
h2. Proposed Solution

Proposal is to return an Empty Spark relation after the write, which will cut down all this
unnecessary time spent to create a parquet relation that never gets used.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message