hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7244) Reduce Namenode memory using Flyweight pattern
Date Mon, 20 Oct 2014 18:44:35 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177280#comment-14177280
] 

Colin Patrick McCabe commented on HDFS-7244:
--------------------------------------------

It's exciting to see progress on this, [~langera]!

There are a few questions we need to figure out here.  One is fallback... if {{ByteBuffer#allocDirect}}
is not available on the JVM, what do we do?  In my earlier patch, I simply used {{ByteBuffer#alloc}}.
 I still like this approach, but it does mean we can't chase raw pointers when implementing
off-heap data structures.  I was trying to address this by using \{ 32-bit slab ID, 32-bit
slab offset \} tuples instead.  This does require that we do a lookup in a {{map<Long,
Slab>}} whenever we chase a "pointer", though.

Another approach to fallback is to use raw pointers if they're available, and \{ slabID, offset\}
tuples if they're not.  This is faster for the common case of true off-heaping.  The complication
here is that theoretically one {{allocDirect}} calls could fail while another succeeds.  If
we did this, we'd probably want to create a configuration key like {{hadoop.use.off.heap}},
and throw a hard failure whenever this was {{true}} but {{allocDirect}} failed.

What data structures are you planning on using to look up block data in the NN?  I was considering
an off-heap hash map implementation.

If you look at the requirements for our BlocksMap, we need:
* fast lookup of \{ 64-bit blockID, string bpId \} to yield all DNs where this block is replicated
* ability to iterate over all blocks which a DN holds

#1 is not too difficult, but #2 could be tricky.  The obvious solution is just to have a hash
map from \{ blockID, bpID \} to a node structure which is a member of a few implicit linked
lists.  This does mean the node structure has variable size, which could be challenging to
implement (It's basically the {{malloc}} problem).  There isn't any upper limit on the number
of DNs a block can be on.

A better way might be to have a hash map from \{ blockID, bpID, replicaIndex \} so that we
avoid implicit linked lists.  So to find the first replica for BlockID 123 in bpID "foo",
you look up (123, foo, 0)... the second, (123, foo, 1), and so forth.

This also raises a few questions.
* should we create a lookup table for bpids?  We clearly don't want to store the string everywhere,
and we can't use Java string interning when doing off-heap.  A 16-bit or 32-bit lookup table
from string bpid -> bpid index would certainly slim this down.
* similar for DNs... how do we identify them?  The storage ID is too long to be practical.
 The simplest way would be a 64-bit ID where we didn't reuse any indices.  If we have 32-bit
or less DN IDs we'll have to figure out some garbage collection strategy, which could be tricky.

Do you think we'll need a branch for this?  I don't have a feeling yet for how incremental
it is.  Clearly adding the Slab code can be done in trunk without destabilizing anything else.
 I'm not as clear on how difficult the other subtasks are going to be to do in an "incremental"
way.

Do you have some code using the Slab code yet?  It might be hard to know exactly what API
we want for Slab until we see how it works in action.  Of course we can always modify it later,
but posting a combined patch would give me a better feel for it.

> Reduce Namenode memory using Flyweight pattern
> ----------------------------------------------
>
>                 Key: HDFS-7244
>                 URL: https://issues.apache.org/jira/browse/HDFS-7244
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Amir Langer
>
> Using the flyweight pattern can dramatically reduce memory usage in the Namenode. The
pattern also abstracts the actual storage type and allows the decision of whether it is off-heap
or not and what is the serialisation mechanism to be configured per deployment. 
> The idea is to move all BlockInfo data (as a first step) to this storage using the Flyweight
pattern. The cost to doing it will be in higher latency when accessing/modifying a block.
The idea is that this will be offset with a reduction in memory and in the case of off-heap,
a dramatic reduction in memory (effectively, memory used for BlockInfo would reduce to a very
small constant value).
> This reduction will also have an huge impact on the latency as GC pauses will be reduced
considerably and may even end up with better latency results than the original code.
> I wrote a stand-alone project as a proof of concept, to show the pattern, the data structure
we can use and what will be the performance costs of this approach.
> see [Slab|https://github.com/langera/slab]
> and [Slab performance results|https://github.com/langera/slab/wiki/Performance-Results].
> Slab abstracts the storage, gives several storage implementations and implements the
flyweight pattern for the application (Namenode in our case).
> The stages to incorporate Slab into the Namenode is outlined in the sub-tasks JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message