From hdfs-issues-return-270893-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org  Mon Jul  8 19:51:02 2019
Return-Path: <hdfs-issues-return-270893-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 0495C180665
	for <archive-asf-public@cust-asf.ponee.io>; Mon,  8 Jul 2019 21:51:01 +0200 (CEST)
Received: (qmail 231 invoked by uid 500); 8 Jul 2019 19:51:01 -0000
Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:hdfs-issues-help@hadoop.apache.org>
List-Unsubscribe: <mailto:hdfs-issues-unsubscribe@hadoop.apache.org>
List-Post: <mailto:hdfs-issues@hadoop.apache.org>
List-Id: <hdfs-issues.hadoop.apache.org>
Delivered-To: mailing list hdfs-issues@hadoop.apache.org
Received: (qmail 219 invoked by uid 99); 8 Jul 2019 19:51:01 -0000
Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jul 2019 19:51:01 +0000
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 67060E2B9D
	for <hdfs-issues@hadoop.apache.org>; Mon,  8 Jul 2019 19:51:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1F6B826564
	for <hdfs-issues@hadoop.apache.org>; Mon,  8 Jul 2019 19:51:00 +0000 (UTC)
Date: Mon, 8 Jul 2019 19:51:00 +0000 (UTC)
From: "Daryn Sharp (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.13242132.1561715566000.603426.1562615460126@Atlassian.JIRA>
In-Reply-To: <JIRA.13242132.1561715566000@Atlassian.JIRA>
References: <JIRA.13242132.1561715566000@Atlassian.JIRA> <JIRA.13242132.1561715566050@jira-lw-us.apache.org>
Subject: [jira] [Commented] (HDFS-14617) Improve fsimage load time by
 writing sub-sections to the fsimage index
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/HDFS-14617?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1688=
0666#comment-16880666 ]=20

Daryn Sharp commented on HDFS-14617:
------------------------------------

Was asked to take a look at this. =C2=A0I think this can be done with no im=
age format incompatibility and minor changes.

How about the image reading=C2=A0thread just=C2=A0adds the=C2=A0inodes to a=
 queue=C2=A0for a thread pool to process? =C2=A0Perhaps just a single threa=
d consuming the queue will be sufficient since it will avoid synch overhead=
s.

> Improve fsimage load time by writing sub-sections to the fsimage index
> ----------------------------------------------------------------------
>
>                 Key: HDFS-14617
>                 URL: https://issues.apache.org/jira/browse/HDFS-14617
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>         Attachments: HDFS-14617.001.patch
>
>
> Loading an fsimage is basically a single threaded process. The current fs=
image is written out in sections, eg iNode, iNode_Directory, Snapshots, Sna=
pshot_Diff etc. Then at the end of the file, an index is written that conta=
ins the offset and length of each section. The image loader code uses this =
index to initialize an input stream to read and process each section. It is=
 important that one section is fully loaded before another is started, as t=
he next section depends on the results of the previous one.
> What I would like to propose is the following:
> 1. When writing the image, we can optionally output sub_sections to the i=
ndex. That way, a given section would effectively be split into several sec=
tions, eg:
> {code:java}
>    inode_section offset 10 length 1000
>      inode_sub_section offset 10 length 500
>      inode_sub_section offset 510 length 500
>     =20
>    inode_dir_section offset 1010 length 1000
>      inode_dir_sub_section offset 1010 length 500
>      inode_dir_sub_section offset 1010 length 500
> {code}
> Here you can see we still have the original section index, but then we al=
so have sub-section entries that cover the entire section. Then a processor=
 can either read the full section in serial, or read each sub-section in pa=
rallel.
> 2. In the Image Writer code, we should set a target number of sub-section=
s, and then based on the total inodes in memory, it will create that many s=
ub-sections per major image section. I think the only sections worth doing =
this for are inode, inode_reference, inode_dir and snapshot_diff. All other=
s tend to be fairly small in practice.
> 3. If there are under some threshold of inodes (eg 10M) then don't bother=
 with the sub-sections as a serial load only takes a few seconds at that sc=
ale.
> 4. The image loading code can then have a switch to enable 'parallel load=
ing' and a 'number of threads' where it uses the sub-sections, or if not en=
abled falls back to the existing logic to read the entire section in serial=
.
> Working with a large image of 316M inodes and 35GB on disk, I have a proo=
f of concept of this change working, allowing just inode and inode_dir to b=
e loaded in parallel, but I believe inode_reference and snapshot_diff can b=
e make parallel with the same technique.
> Some benchmarks I have are as follows:
> {code:java}
> Threads   1     2     3     4=20
> --------------------------------
> inodes    448   290   226   189=20
> inode_dir 326   211   170   161=20
> Total     927   651   535   488 (MD5 calculation about 100 seconds)
> {code}
> The above table shows the time in seconds to load the inode section and t=
he inode_directory section, and then the total load time of the image.
> With 4 threads using the above technique, we are able to better than half=
 the load time of the two sections. With the patch in HDFS-13694 it would t=
ake a further 100 seconds off the run time, going from 927 seconds to 388, =
which is a significant improvement. Adding more threads beyond 4 has dimini=
shing returns as there are some synchronized points in the loading code to =
protect the in memory structures.


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org