hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Feichi Feng (Jira)" <>
Subject [jira] [Commented] (HUDI-724) Parallelize GetSmallFiles For Partitions
Date Mon, 23 Mar 2020 18:14:00 GMT


Feichi Feng commented on HUDI-724:

Hi [~vbalaji], is there anything else I need to address for the PR? 

> Parallelize GetSmallFiles For Partitions
> ----------------------------------------
>                 Key: HUDI-724
>                 URL:
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Performance, Writer Core
>            Reporter: Feichi Feng
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: gap.png, nogapAfterImprovement.png
>   Original Estimate: 48h
>          Time Spent: 0.5h
>  Remaining Estimate: 47.5h
> When writing data, a gap was observed between spark stages. By tracking down where the
time was spent on the spark driver, it's get-small-files operation for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it uses a normal
for-loop for get the list of small files for all partitions that the load is going to load
data to, and the process is very slow when there are a lot of partitions to go through. While
the operation is running on spark driver process, all other worker nodes are sitting idle
waiting for tasks.
> For all those partitions, they don't affect each other, so the get-small-files operations
can be parallelized. The change I made is to pass the JavaSparkContext to the UpsertPartitioner,
and create RDD for the partitions and eventually send the get small files operations to multiple
> screenshot attached for 
> the gap without the improvement
> the spark stage with the improvement (no gap)

This message was sent by Atlassian Jira

View raw message