hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Feichi Feng (Jira)" <>
Subject [jira] [Created] (HUDI-724) Parallelize GetSmallFiles For Partitions
Date Thu, 19 Mar 2020 23:32:00 GMT
Feichi Feng created HUDI-724:

             Summary: Parallelize GetSmallFiles For Partitions
                 Key: HUDI-724
             Project: Apache Hudi (incubating)
          Issue Type: Improvement
          Components: Performance, Writer Core
            Reporter: Feichi Feng
         Attachments: gap.png, nogapAfterImprovement.png

When writing data, a gap was observed between spark stages. By tracking down where the time
was spent on the spark driver, it's get-small-files operation for partitions.

When creating the UpsertPartitioner and trying to assign insert records, it uses a normal
for-loop for get the list of small files for all partitions that the load is going to load
data to, and the process is very slow when there are a lot of partitions to go through. While
the operation is running on spark driver process, all other worker nodes are sitting idle
waiting for tasks.

For all those partitions, they don't affect each other, so the get-small-files operations
can be parallelized. The change I made is to pass the JavaSparkContext to the UpsertPartitioner,
and create RDD for the partitions and eventually send the get small files operations to multiple

This message was sent by Atlassian Jira

View raw message