hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo Nicholas Sze (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
Date Fri, 09 Oct 2015 01:51:28 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949749#comment-14949749

Tsz Wo Nicholas Sze commented on HDFS-9053:

Hi [~hitliuyi],

44 bytes overhead is not small especially when most of the directories fall in this case.
 Just like that we did very hard but only able to save 40 bytes per replica in HDFS-8859.

I agree that the current BTree implementation has an advantage that it can shrink.  We may
also implement our a shrinkable array list.  The shrinkable feature does not seem a big deal
since this is not a common case.

> ... we may need write a class to wrap the two data structures, ...

Wrapping is not needed.  The field in INodeDirectory is List<INode> children which may
refer to either an ArrayList or a BTreeList.  We may replace the list at runtime.

For the new B-Tree implementation.  I am also worry about the potiental bugs and the risk.
 If there is a bug in B-Tree, it is possible to lose one or more sub trees and, as a result,
lose a lot of data.  ArrayList is already well-tested.  Anyway, we need more tests for the
B-Tree, especially some long running random tests.

> Support large directories efficiently using B-Tree
> --------------------------------------------------
>                 Key: HDFS-9053
>                 URL: https://issues.apache.org/jira/browse/HDFS-9053
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Yi Liu
>            Assignee: Yi Liu
>            Priority: Critical
>         Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 (BTree).patch,
HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, HDFS-9053.004.patch
> This is a long standing issue, we were trying to improve this in the past.  Currently
we use an ArrayList for the children under a directory, and the children are ordered in the
list, for insert/delete, the time complexity is O\(n), (the search is O(log n), but insertion/deleting
causes re-allocations and copies of arrays), for large directory, the operations are expensive.
 If the children grow to 1M size, the ArrayList will resize to > 1M capacity, so need >
1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) continuous heap memory,
it easily causes full GC in HDFS cluster where namenode heap memory is already highly used.
 I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because re-allocations
and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be long-lived can
easily cause full GC problem.
> # Even most children are removed later, but the directory INode still occupies same size
heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to solve the problem
suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and use it to
replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree node), the
B-Tree only has one root node which contains an array for the elements. And if the size grows
large enough, it will split automatically, and if elements are removed, then B-Tree nodes
can merge automatically (see more: https://en.wikipedia.org/wiki/B-tree).  It will solve the
above 3 issues.

This message was sent by Atlassian JIRA

View raw message