hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yanjia Gary Li (Jira)" <j...@apache.org>
Subject [jira] [Closed] (HUDI-315) Reimplement statistics/workload profile collected during writes using Spark 2.x custom accumulators
Date Fri, 28 Feb 2020 01:59:00 GMT

     [ https://issues.apache.org/jira/browse/HUDI-315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yanjia Gary Li closed HUDI-315.
-------------------------------
    Resolution: Won't Fix

> Reimplement statistics/workload profile collected during writes using Spark 2.x custom
accumulators
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-315
>                 URL: https://issues.apache.org/jira/browse/HUDI-315
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Performance, Writer Core
>            Reporter: Vinoth Chandar
>            Assignee: Yanjia Gary Li
>            Priority: Major
>
> https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1

> In Hudi, there are two places where we need to obtain statistics on the input data 
> - HoodieBloomIndex  : for knowing what partitions need to be loaded and checked against
(is this still needed with the timeline server enabled is a separate question) 
> - Workload profile to get a sense of number of updates, inserts to each partition/file
group
> Both of them issue their own groupBy or shuffle computation today. This can be avoided
using an accumulator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message