hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yi Liu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
Date Tue, 29 Sep 2015 04:17:04 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934598#comment-14934598

Yi Liu commented on HDFS-9053:

Thanks Jing for the further review. 
Yeah, sorry for the delay.
NP, thanks for the help :)

I also think it's a critical part of the code, thanks a lot for the review, [~szetszwo]! also
welcome [~kihwal] and anyone who can review this.

For insert/delete, the running time is O( n) for ArrayList.
Yeah, if we count the re-allocations and copies of big arrays, I just mean the search time
complexity before we do insert/delete.

How about memory usage? One reason to use ArrayList instead of a more complicated data structure
is to save memory. It seems to me that using B-Tree increases memory usage in general. It
increases memory usage dramatically in some worst cases such as the average branching factor
of the tree is small (this may also be a common case).
That's a great question here.  I have given few description in the my second comment above:
I would give a detailed explanation:
Generally say, the B-Tree in the patch increases *very few* memory usage, and can be ignored:
# *For normal and small size directories*, let's say it's (INode) children size is not large
and then B-Tree only contains one node. Currently the default degree of B-Tree used in INodeDirectory
is 1024, so the max degree is 2047, that means one B-Tree node can at most contain 2047 elements.
And the children of the B-Tree root node is null, so the increased memory usage compared to
ArrayList is: one object overhead + few variables, total increase about (12+4+4+4+4+4) = 32
# *For large directories*, let's say it's (INode) children size is larger than the max degree
and then B-Tree contains more than one nodes. For any B-Tree node except root, it will contains
at least min degree size elements, in our case, it's 1023, furthermore, we expand elements
of B-Tree node same as ArrayList, so it's the same as ArrayList at this point, typically the
worst case for B-Tree from memory point view is all elements are in-order, in this case, when
we split a B-Tree node, we allocate the elements size of memory and do expanding, but no more
elements are added later in the split left B-Tree node, and there is a waste of 1/3 size of
elements array memory, but actually it's almost the same as the worst cases even for ArrayList
if there is no/few elements added, but in practice, the elements are not in order.  Now back
to the overhead, for B-Tree, for every B-Tree node, the increased memory usage is: one object
overhead + one element pointer in the father + 2 variables, so total increase about (12 +
4+4+4) = 24 bytes for every new added B-Tree node, but one B-Tree node contains at least (degree
- 1) elements, and at most (2* degree - 1) elements.  Another small increased memory is the
object overhead of the children in the B-Tree inner node.

In conclusion, the increased memory used for B-Tree compared to ArrayList is very small, for
small/normal directories, it's only 32 bytes overhead which is even less than memory of one
block in NN, for large directories, besides additional increased 12 bytes memory for the variables
in B-Tree , it will increase about 24 bytes memory for every new added B-Tree node which can
contain 1023 ~ 2047 INodes in the directory in our case.  We can ignore these few memory overhead
for a directory. 

And last, the benefits of B-Tree is great and obvious as described in the JIRA description,
also in the patch, I consider the memory overhead carefully while implementing the B-Tree.

Please let me know if you have further questions. Thanks a lot.

> Support large directories efficiently using B-Tree
> --------------------------------------------------
>                 Key: HDFS-9053
>                 URL: https://issues.apache.org/jira/browse/HDFS-9053
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Yi Liu
>            Assignee: Yi Liu
>            Priority: Critical
>         Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 (BTree).patch,
HDFS-9053.001.patch, HDFS-9053.002.patch
> This is a long standing issue, we were trying to improve this in the past.  Currently
we use an ArrayList for the children under a directory, and the children are ordered in the
list, for insert/delete/search, the time complexity is O(log n), but insertion/deleting causes
re-allocations and copies of big arrays, so the operations are costly.  For example, if the
children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > 1M *
4bytes = 4M continuous heap memory, it easily causes full GC in HDFS cluster where namenode
heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because re-allocations
and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be long-lived can
easily cause full GC problem.
> # Even most children are removed later, but the directory INode still occupies same size
heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to solve the problem
suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and use it to
replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree node), the
B-Tree only has one root node which contains an array for the elements. And if the size grows
large enough, it will split automatically, and if elements are removed, then B-Tree nodes
can merge automatically (see more: https://en.wikipedia.org/wiki/B-tree).  It will solve the
above 3 issues.

This message was sent by Atlassian JIRA

View raw message