hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6658) Namenode memory optimization - Block replicas list
Date Thu, 19 Mar 2015 01:47:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368324#comment-14368324

Colin Patrick McCabe commented on HDFS-6658:

I didn't get a chance to look at the patch, but [~clamb] and I read the design doc in detail.

If I understand this correctly, the goals of this patch are better GC behavior and slightly
smaller memory consumption for large heaps.  The patch accomplishes this by making use of
the most memory-efficient structures java has to offer: primitive arrays.  The patch is actually
a small regression in heap memory size for smaller heaps, though (<32 GB).

My biggest concern is that I don't see how this data structure could be parallelized.  And
parallelization is an important goal of the block manager scalability work in HDFS-7836. 
In fact, this data structure seems to tie us more and more into the single-threaded world,
since we now have more tightly coupled data structures to keep consistent.

Another concern is complexity and (lack of) safety.  I feel like this approach has all the
complexity of off-heap (manual serialization, packing bits, possibility for reading the wrong
data if there is a bug) without the positive aspects of off-heap (ability to grow and reduce
memory consumption seamlessly via malloc, a "safe mode" where all allocations are checked,
possibility of serializing block data to disk later on.)  The off-heap code we wrote does
not use JNI, and has a mode which checks all allocations and accesses.  I don't see anything
similar here (although maybe it could be added?)  The complexity will remain, though.

This is a bit of a tangent, but I still think that off-heap is a better solution than large
arrays of primitives.  For one thing, if we off-heap the inodes map and the blocks map, we
can potentially get the JVM size below 32GB and start getting the benefits of pointer compression
again.  These benefits are not only memory usage, but also speed, since the bytecode gets
smaller.  For another thing, we basically have to write our own malloc implementation if we
want to go down the "on-heap arrays" road.  All of the malloc implementations I've seen were
either quite complex (jemalloc, tcmalloc) or inflexible (various arena allocators that give
out fixed-size chunks of memory, kind of like the one in this patch), or crude and inefficient
(a homemade malloc with a few lists of blocks of size 2^N)

I think the basic approach might be good, but we need to talk more about how it fits into
other improvements.  I also think this belongs in a branch, possibly HDFS-7836.

> Namenode memory optimization - Block replicas list 
> ---------------------------------------------------
>                 Key: HDFS-6658
>                 URL: https://issues.apache.org/jira/browse/HDFS-6658
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.4.1
>            Reporter: Amir Langer
>            Assignee: Daryn Sharp
>         Attachments: BlockListOptimizationComparison.xlsx, BlocksMap redesign.pdf, HDFS-6658.patch,
HDFS-6658.patch, HDFS-6658.patch, Namenode Memory Optimizations - Block replicas list.docx,
New primative indexes.jpg, Old triplets.jpg
> Part of the memory consumed by every BlockInfo object in the Namenode is a linked list
of block references for every DatanodeStorageInfo (called "triplets"). 
> We propose to change the way we store the list in memory. 
> Using primitive integer indexes instead of object references will reduce the memory needed
for every block replica (when compressed oops is disabled) and in our new design the list
overhead will be per DatanodeStorageInfo and not per block replica.
> see attached design doc. for details and evaluation results.

This message was sent by Atlassian JIRA

View raw message