hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (Jira)" <j...@apache.org>
Subject [jira] [Updated] (HUDI-656) Write Performance - Driver spends too much time creating Parquet DataSource after writes
Date Tue, 10 Mar 2020 23:19:00 GMT

     [ https://issues.apache.org/jira/browse/HUDI-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

ASF GitHub Bot updated HUDI-656:
--------------------------------
    Labels: pull-request-available  (was: )

> Write Performance - Driver spends too much time creating Parquet DataSource after writes
> ----------------------------------------------------------------------------------------
>
>                 Key: HUDI-656
>                 URL: https://issues.apache.org/jira/browse/HUDI-656
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Performance, Spark Integration
>            Reporter: Udit Mehrotra
>            Assignee: Udit Mehrotra
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.6.0
>
>
> h2. Problem Statement
> We have noticed this performance bottleneck at EMR, and it has been reported here as
well [https://github.com/apache/incubator-hudi/issues/1371]
> Hudi for writes through DataSource API uses [this|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L85]
to create the spark relation. Here it uses HoodieSparkSqlWriter to write the dataframe and
after it tries to [return|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L92]
a relation by creating it through parquet data source [here|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L72]
> In the process of creating this parquet data source, Spark creates an *InMemoryFileIndex* [here|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L371]
as part of which it performs file listing of the base path. While the listing itself is [parallelized|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L289],
the filter that we pass which is *HoodieROTablePathFilter* is applied [sequentially|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L294]
on the driver side on all the 1000s of files returned during listing. This part is not parallelized
by spark, and it takes a lot of time probably because of the filters logic. This causes the
driver to just spend time filtering. We have seen it take 10-12 minutes to do this process
for just 50 partitions in S3, and this time is spent after the writing has finished.
> Solving this will significantly reduce the writing time across all sorts of writes. This
time is essentially getting wasted, because we do not really have to return a relation after
the write. This relation is never really used by Spark either ways [here|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala#L45]
and writing process returns empty set of rows..
> h2. Proposed Solution
> Proposal is to return an Empty Spark relation after the write, which will cut down all
this unnecessary time spent to create a parquet relation that never gets used.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message