Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Thu, 22 Oct 2015 00:52:27 +0000 (UTC)
From: "Yi Liu (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12863311.1441956186000.24221.1445475147811@Atlassian.JIRA>
In-Reply-To: <JIRA.12863311.1441956186000@Atlassian.JIRA>
References: <JIRA.12863311.1441956186000@Atlassian.JIRA>
 <JIRA.12863311.1441956186938@arcas>
Subject: [jira] [Comment Edited] (HDFS-9053) Support large directories
 efficiently using B-Tree
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968277#comment-14968277 ] 

Yi Liu edited comment on HDFS-9053 at 10/22/15 12:51 AM:
---------------------------------------------------------

Thanks [~jingzhao]!

{quote}
My concern is if the number of directories is limited, then maybe increasing 44 bytes per directory is fine. E.g., for a big cluster with >100M files/directories, if we only have 1M directories, then we only increase the heap size by 44 MB. Compared with the total heap size this may be just <0.1% increase.
{quote}
Each directory increasing 44 bytes is the old implementation.  
In the latest patch, each directory only increase *8* bytes.  So it's quite small for directory.  The new approach also just uses btree only, it's NOT to switch between ArrayList and btree, please look at the {{007}} patch.

I know some users have more than 500M files, and some directory contains quite lots of files, usually this kind of large directory is accessed most frequently, just as "barrel principle", this becomes the bottleneck of NN.


was (Author: hitliuyi):
Thanks [~jingzhao]!

{quote}
My concern is if the number of directories is limited, then maybe increasing 44 bytes per directory is fine. E.g., for a big cluster with >100M files/directories, if we only have 1M directories, then we only increase the heap size by 44 MB. Compared with the total heap size this may be just <0.1% increase.
{quote}
Each directory increasing 44 bytes is the old implementation.  
In the latest patch, each directory only increase *8* bytes.  So it's quite small for directory. 

I know some users have more than 500M files, and some directory contains quite lots of files, usually this kind of large directory is accessed most frequently, just as "barrel principle", this becomes the bottleneck of NN.

> Support large directories efficiently using B-Tree
> --------------------------------------------------
>
>                 Key: HDFS-9053
>                 URL: https://issues.apache.org/jira/browse/HDFS-9053
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Yi Liu
>            Assignee: Yi Liu
>            Priority: Critical
>         Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, HDFS-9053.007.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  Currently we use an ArrayList for the children under a directory, and the children are ordered in the list, for insert/delete, the time complexity is O\(n), (the search is O(log n), but insertion/deleting causes re-allocations and copies of arrays), for large directory, the operations are expensive.  If the children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) continuous heap memory, it easily causes full GC in HDFS cluster where namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree node), the B-Tree only has one root node which contains an array for the elements. And if the size grows large enough, it will split automatically, and if elements are removed, then B-Tree nodes can merge automatically (see more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 issues.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)