From hdfs-dev-return-40774-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org  Fri Jan  3 14:01:40 2020
Return-Path: <hdfs-dev-return-40774-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 18C0918061A
	for <archive-asf-public@cust-asf.ponee.io>; Fri,  3 Jan 2020 15:01:40 +0100 (CET)
Received: (qmail 89612 invoked by uid 500); 3 Jan 2020 14:01:38 -0000
Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:hdfs-dev-help@hadoop.apache.org>
List-Unsubscribe: <mailto:hdfs-dev-unsubscribe@hadoop.apache.org>
List-Post: <mailto:hdfs-dev@hadoop.apache.org>
List-Id: <hdfs-dev.hadoop.apache.org>
Delivered-To: mailing list hdfs-dev@hadoop.apache.org
Received: (qmail 89598 invoked by uid 99); 3 Jan 2020 14:01:38 -0000
Received: from Unknown (HELO mailrelay1-lw-us.apache.org) (10.10.3.159)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jan 2020 14:01:38 +0000
Received: from mail-lf1-f46.google.com (mail-lf1-f46.google.com [209.85.167.46])
	by mailrelay1-lw-us.apache.org (ASF Mail Server at mailrelay1-lw-us.apache.org) with ESMTPSA id 2C74010FC
	for <hdfs-dev@hadoop.apache.org>; Fri,  3 Jan 2020 14:01:38 +0000 (UTC)
Received: by mail-lf1-f46.google.com with SMTP id n12so31942940lfe.3
        for <hdfs-dev@hadoop.apache.org>; Fri, 03 Jan 2020 06:01:38 -0800 (PST)
X-Gm-Message-State: APjAAAWw8m2Ow6guOWvCI/fnT2HygQi9gdBwiriqA5qtTQsWV1hm5IT5
	bJf1TZf0rqTDfJjswcf3IdfxmKyx1RVtRQmHf/UOGw==
X-Google-Smtp-Source: APXvYqy0tPD3sEf0+iYhpVvHvAcYShscxZqxtgcwGvlYbQ7vPdvl8rGfhGwu3e7hmHfIdnc81gWePZslIyzDmr+J9yk=
X-Received: by 2002:ac2:498e:: with SMTP id f14mr49596767lfl.172.1578060096838;
 Fri, 03 Jan 2020 06:01:36 -0800 (PST)
MIME-Version: 1.0
References: <CADiq6=zzMp7ERjDUVocBiRjJQEZPkuUHVH-aPv1UxBSmOnpyUg@mail.gmail.com>
In-Reply-To: <CADiq6=zzMp7ERjDUVocBiRjJQEZPkuUHVH-aPv1UxBSmOnpyUg@mail.gmail.com>
From: Wei-Chiu Chuang <weichiu@apache.org>
Date: Fri, 3 Jan 2020 22:01:10 +0800
X-Gmail-Original-Message-ID: <CADiq6=yDpGq1DsDr3We-5YHm3uPEt6S=2bYqAMq3Q_Y_G9LUog@mail.gmail.com>
Message-ID: <CADiq6=yDpGq1DsDr3We-5YHm3uPEt6S=2bYqAMq3Q_Y_G9LUog@mail.gmail.com>
Subject: Re: Reminder: Hadoop Storage Online Meetup tomorrow (Hadoop 2->3 upgrade)
To: Hadoop Common <common-dev@hadoop.apache.org>, Hdfs-dev <hdfs-dev@hadoop.apache.org>
Content-Type: multipart/alternative; boundary="00000000000081c1f7059b3cbf58"

--00000000000081c1f7059b3cbf58
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi, it was a well attended session with more than 40 attendees joined!
Thanks Fei Hui for giving us such a great talk.

Here's the summary for your reference.

https://docs.google.com/document/d/1jXM5Ujvf-zhcyw_5kiQVx6g-HeKe-YGnFS_1-qF=
XomI/edit?usp=3Dsharing
01/02/2020 Didi talked about their large scale HDFS cluster upgrade
experience.

Slides:
https://drive.google.com/open?id=3D1iwJ1asalYfgnOCBuE-RfeG-NpSocjIcy

Didi studied two upgrade approaches from the community documentation:
express upgrade and rolling upgrade. Rolling upgrade was selected.

The upgrade involved HDFS server side only. Clients are still on Hadoop 2.7
because applications such as Hive and Spark does not support Hadoop 3 yet.

Zookeeper was not upgraded.

Didi practiced upgrade + downgrade more than 10 times before doing it for
real.

Didi=E2=80=99s largest cluster has 5 federated namespaces, and 10+ thousand=
 nodes.
The upgrade took a month. JournalNodes took 1 week; NameNode: 2 weeks;
DataNodes took a week.

During upgrade, HDFS does not clean up trash. Because the upgrade window
was a month long, the trash became a concern because it could exhaust all
available space. Didi has a (script?) to clean trash daily.

A problem was encountered which may not be related: Clients were
occasionally unable to close files. Solution: reviewed DataNode log, and
found that the blocks were not reported in time, and that was because
delete blocks took too long.

Two parameters were changed to address the issue:

Increase dfs.client.block.write.locateFollowingBlock.retries and

Reduce dfs.block.invalidate.limit (from the default 1000 to 500)

Didi believes the new upstream change HDFS-14997 can alleviate this issue.

Timeline:

May 2019, verified the plan is good.

July: trial run with a 100-node cluster, completed rolling upgrade
successfully.

Oct: 300+ node cluster rolling upgrade completed.

Nov: 10-thousand node cluster rolling upgrade completed.

Offline test

Had Spark, Hive and Hadoop full test set. Verified the upgrade/downgrade
has no impact.

Reviewed the 4000+ patches between Hadoop 2.7 and 3.2, to make sure there=
=E2=80=99s
no incompatible changes.

Authored 40+ internal wikis to document the process.

Future:

Didi=E2=80=99s interested in Ozone to address the small file problems.

Want to incorporate the Consistent Read from Standby feature to increase
NameNode RPC performance.

Finally, DataNode upgrade is hard. Will look into HDFS Maintenance Mode to
make this easier in the future.

This is a HDFS-only upgrade work. YARN upgrade is planned in the second
half of 2020. Since the main purpose is to use EC to reduce space usage,
Didi ported EC client side code to Hadoop 2.7 clients, and these clients
can read/write EC blocks!


On Wed, Jan 1, 2020 at 7:42 PM Wei-Chiu Chuang <weichiu@apache.org> wrote:

> Hi,
> This is a gentle reminder for tomorrow's online meetup. Fei Hui from DiDi
> is going to give a presentation about DiDi's Hadoop 2 -> Hadoop 3 upgrade
> experience.
>
> We will extend this session to 1 hour. Fei will speak in Mandarin and I
> will help translate. So non-Mandarin speakers feel free to join!
>
> Time/Date:
> Jan 1 10PM (US west coast PST) / Jan 2 2pm (Beijing, China CST) / Jan 2
> 11:30am (India, IST) / Jan 2 3pm (Tokyo, Japan, JST)
>
> Join Zoom Meeting
>
> https://cloudera.zoom.us/j/880548968
>
> One tap mobile
>
> +16465588656,,880548968# US (New York)
>
> +17207072699,,880548968# US
>
> Dial by your location
>
>         +1 646 558 8656 US (New York)
>
>         +1 720 707 2699 US
>
>         877 853 5257 US Toll-free
>
>         888 475 4499 US Toll-free
>
> Meeting ID: 880 548 968
> Find your local number: https://zoom.us/u/acaGRDfMVl
>

--00000000000081c1f7059b3cbf58--