hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (Jira)" <>
Subject [jira] [Updated] (HUDI-656) Write Performance - Driver spends too much time creating Parquet DataSource after writes
Date Tue, 10 Mar 2020 23:19:00 GMT


ASF GitHub Bot updated HUDI-656:
    Labels: pull-request-available  (was: )

> Write Performance - Driver spends too much time creating Parquet DataSource after writes
> ----------------------------------------------------------------------------------------
>                 Key: HUDI-656
>                 URL:
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Performance, Spark Integration
>            Reporter: Udit Mehrotra
>            Assignee: Udit Mehrotra
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.6.0
> h2. Problem Statement
> We have noticed this performance bottleneck at EMR, and it has been reported here as
well []
> Hudi for writes through DataSource API uses [this|]
to create the spark relation. Here it uses HoodieSparkSqlWriter to write the dataframe and
after it tries to [return|]
a relation by creating it through parquet data source [here|]
> In the process of creating this parquet data source, Spark creates an *InMemoryFileIndex* [here|]
as part of which it performs file listing of the base path. While the listing itself is [parallelized|],
the filter that we pass which is *HoodieROTablePathFilter* is applied [sequentially|]
on the driver side on all the 1000s of files returned during listing. This part is not parallelized
by spark, and it takes a lot of time probably because of the filters logic. This causes the
driver to just spend time filtering. We have seen it take 10-12 minutes to do this process
for just 50 partitions in S3, and this time is spent after the writing has finished.
> Solving this will significantly reduce the writing time across all sorts of writes. This
time is essentially getting wasted, because we do not really have to return a relation after
the write. This relation is never really used by Spark either ways [here|]
and writing process returns empty set of rows..
> h2. Proposed Solution
> Proposal is to return an Empty Spark relation after the write, which will cut down all
this unnecessary time spent to create a parquet relation that never gets used.

This message was sent by Atlassian Jira

View raw message