hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Balaji Varadarajan (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HUDI-724) Parallelize GetSmallFiles For Partitions
Date Fri, 20 Mar 2020 15:29:00 GMT

    [ https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063454#comment-17063454
] 

Balaji Varadarajan commented on HUDI-724:
-----------------------------------------

[~uditme] : The PR also looks good to me too. Regarding speedup with timeline server, the
cache loading(file-listing) does support concurrency. Can you provide the stage times with
embedded server turned on.  My belief is that you should see reduced time taken in getSmallFiles()
as the cache would have populated during bloom index lookup. And bloom index lookup calls
are also parallelized. So, We need to understand why you are not seeing considerable improvements.
If we can have the cache loading time optimized and caching enabled, it would avoid redundant
listing calls made during upsert call.

> Parallelize GetSmallFiles For Partitions
> ----------------------------------------
>
>                 Key: HUDI-724
>                 URL: https://issues.apache.org/jira/browse/HUDI-724
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Performance, Writer Core
>            Reporter: Feichi Feng
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: gap.png, nogapAfterImprovement.png
>
>   Original Estimate: 48h
>          Time Spent: 10m
>  Remaining Estimate: 47h 50m
>
> When writing data, a gap was observed between spark stages. By tracking down where the
time was spent on the spark driver, it's get-small-files operation for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it uses a normal
for-loop for get the list of small files for all partitions that the load is going to load
data to, and the process is very slow when there are a lot of partitions to go through. While
the operation is running on spark driver process, all other worker nodes are sitting idle
waiting for tasks.
> For all those partitions, they don't affect each other, so the get-small-files operations
can be parallelized. The change I made is to pass the JavaSparkContext to the UpsertPartitioner,
and create RDD for the partitions and eventually send the get small files operations to multiple
tasks.
>  
> screenshot attached for 
> the gap without the improvement
> the spark stage with the improvement (no gap)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message