hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amir Langer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7244) Reduce Namenode memory using Flyweight pattern
Date Tue, 21 Oct 2014 07:13:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178063#comment-14178063

Amir Langer commented on HDFS-7244:

Thanks for the comments [~cmccabe] all valid points. A few replies below:

h4. Fallback
I wasn't planning on introducing fallback. Instead I wanted to make the storage type configurable
(in a similar way to your suggestion). If someone mis-configures and defines a direct byte
buffer storages while using a JVM which does not support it, they'll get an error and the
Namenode should not start. 
I believe that's a good behaviour in this case. Especially if they get a meaningful error
message which points them to the storage type configuration and asks them to change it. 
The configuration I have for a storage is offheap - true/false and use of unsafe true/false.
(unsafe is much faster than a ByteBuffer but some people may be reluctant to use it).
My [Slab|https://github.com/langera/slab] implementation uses a single long for the address
regardless of the storage type.
In fact, if someone uses some really exotic JVM and want to use some home grown storage impl.
they can implement the [interface|https://github.com/langera/slab/blob/master/src/main/java/com/yahoo/slab/SlabStorage.java]
and define their own storage (off-heap or not).

h4. BlocksMap data structures, ids etc.
My intention was to use integer or long ids for every entity in order to have a predictable
size for a block, given its replication factor.
In HDFS-6660 I already added int ids to every DatanodeStorageInfo. 
The plan is like in HDFS-6658, Have a list of blocks per DNStorageInfo in every DNStorageInfo.
The main change would have to be that the list will be of block ids. I've already shown that
the only need for the linked list we have today (implemented by the "triplets") is to mark
visited blocks when processing a block report and that could be done differently using a BitSet
(see HDFS-6661 ). This reduces the replication information in every block to an int id (of
the DNStorageInfo) and an int index (of that block in the list of blocks at the DNStorageInfo
- to allow efficient remove).

Keeping the list as a property of every DNStorageInfo should give us the needed iteration
requirement you mention.

My idea for the fast lookup requirement will be to look up using {replication factor,  blockId}.

The idea is to have a slab per replication id.
This is because of the limitation of the slab. The size (in terms of space req.) of the stored
object must be fixed and known in advance. (This limitation allows the slab to efficiently
manage the storage and implement compaction).

A BlockInfo size is determined only by its replication factor.

So lookup will be - locate slab by replication factor (that could be an index in an array
of slabs) and call {{get(id)}} on that slab.
The downside of this lookup idea is: Change to replication factor will require us to copy
the data from one slab to another. A much higher cost than today.
We believe that such changes are not frequent at all and overall implementing it this way
will be justified although we realise this is a key point the community will need to agree
on if we move forward with this idea.  

I still didn't get my head round the use of blockPoolId in this lookup. You mention the lookup
refers to the blockPoolId but I can't see it in the blocksMap impl. on trunk. I assume we
can always use your suggestion of a table for String bpIds but maybe there's a better idea
out there. 

h4. Slab usage example
Is [here|https://github.com/langera/slab/wiki/Example-Code].
It is also used in the [Perf tests|https://github.com/langera/slab/tree/master/src/test/java/com/yahoo/slab/perf]
code I've got in the project.
Think of the {{Bean}} class there as the API of BlockInfo. We need to implement a Flyweight
for it and a factory to construct that flyweight.
Both the flyweight and the actual POJO implement the API of BlockInfo (Bean in the perf test
example) and then can be used as such.

Slab API is very similar to a basic List API but with long keys rather than int.


> Reduce Namenode memory using Flyweight pattern
> ----------------------------------------------
>                 Key: HDFS-7244
>                 URL: https://issues.apache.org/jira/browse/HDFS-7244
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Amir Langer
> Using the flyweight pattern can dramatically reduce memory usage in the Namenode. The
pattern also abstracts the actual storage type and allows the decision of whether it is off-heap
or not and what is the serialisation mechanism to be configured per deployment. 
> The idea is to move all BlockInfo data (as a first step) to this storage using the Flyweight
pattern. The cost to doing it will be in higher latency when accessing/modifying a block.
The idea is that this will be offset with a reduction in memory and in the case of off-heap,
a dramatic reduction in memory (effectively, memory used for BlockInfo would reduce to a very
small constant value).
> This reduction will also have an huge impact on the latency as GC pauses will be reduced
considerably and may even end up with better latency results than the original code.
> I wrote a stand-alone project as a proof of concept, to show the pattern, the data structure
we can use and what will be the performance costs of this approach.
> see [Slab|https://github.com/langera/slab]
> and [Slab performance results|https://github.com/langera/slab/wiki/Performance-Results].
> Slab abstracts the storage, gives several storage implementations and implements the
flyweight pattern for the application (Namenode in our case).
> The stages to incorporate Slab into the Namenode is outlined in the sub-tasks JIRAs.

This message was sent by Atlassian JIRA

View raw message