From commits-return-13851-archive-asf-public=cust-asf.ponee.io@hudi.apache.org  Sat Mar 21 01:07:03 2020
Return-Path: <commits-return-13851-archive-asf-public=cust-asf.ponee.io@hudi.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 2CD7B18066D
	for <archive-asf-public@cust-asf.ponee.io>; Sat, 21 Mar 2020 02:07:03 +0100 (CET)
Received: (qmail 60904 invoked by uid 500); 21 Mar 2020 01:07:02 -0000
Mailing-List: contact commits-help@hudi.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:commits-help@hudi.apache.org>
List-Unsubscribe: <mailto:commits-unsubscribe@hudi.apache.org>
List-Post: <mailto:commits@hudi.apache.org>
List-Id: <commits.hudi.apache.org>
Reply-To: dev@hudi.apache.org
Delivered-To: mailing list commits@hudi.apache.org
Received: (qmail 60811 invoked by uid 99); 21 Mar 2020 01:07:02 -0000
Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 21 Mar 2020 01:07:02 +0000
Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 8AB8EE2F0E
	for <commits@hudi.apache.org>; Sat, 21 Mar 2020 01:07:01 +0000 (UTC)
Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1])
	by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 606097802D4
	for <commits@hudi.apache.org>; Sat, 21 Mar 2020 01:07:00 +0000 (UTC)
Date: Sat, 21 Mar 2020 01:07:00 +0000 (UTC)
From: "Feichi Feng (Jira)" <jira@apache.org>
To: commits@hudi.apache.org
Message-ID: <JIRA.13292800.1584660711000.102852.1584752820391@Atlassian.JIRA>
In-Reply-To: <JIRA.13292800.1584660711000@Atlassian.JIRA>
References: <JIRA.13292800.1584660711000@Atlassian.JIRA> <JIRA.13292800.1584660711363@jira-he-de>
Subject: [jira] [Commented] (HUDI-724) Parallelize GetSmallFiles For
 Partitions
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/HUDI-724?page=3Dcom.atlassian.j=
ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D170636=
97#comment-17063697 ]=20

Feichi Feng commented on HUDI-724:
----------------------------------

Hi [~vbalaji], partition touched depends on our data, the screenshot I atta=
ched for "nogapAfterImprovement", it looks there are >=3D 45 partitions tou=
ched(was hard-coded parallelism at that time). We do have a lot of files in=
 a partition(randomly checked one, it has 6000+ files, probably caused by r=
ecordsize not been correctly set).=C2=A0

> Parallelize GetSmallFiles For Partitions
> ----------------------------------------
>
>                 Key: HUDI-724
>                 URL: https://issues.apache.org/jira/browse/HUDI-724
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Performance, Writer Core
>            Reporter: Feichi Feng
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: gap.png, nogapAfterImprovement.png
>
>   Original Estimate: 48h
>          Time Spent: 0.5h
>  Remaining Estimate: 47.5h
>
> When writing data, a gap was observed between spark stages. By tracking d=
own where the time was spent on the spark driver, it's get-small-files oper=
ation for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, =
it uses a normal for-loop for get the list of small files for all partition=
s that the load is going to load data to, and the process is very slow when=
 there are a lot of partitions to go through. While the operation is runnin=
g on spark driver process, all other worker nodes are sitting idle waiting =
for tasks.
> For all those partitions, they don't affect each other, so the get-small-=
files operations can be parallelized. The change I made is to pass the Java=
SparkContext to the UpsertPartitioner, and create RDD for the partitions an=
d eventually send the get small files operations to multiple tasks.
> =C2=A0
> screenshot attached for=C2=A0
> the gap without the improvement
> the spark stage with the improvement (no gap)


--
This message was sent by Atlassian Jira
(v8.3.4#803005)