Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Message-ID: <21964799.1188341010989.JavaMail.jira@brutus>
Date: Tue, 28 Aug 2007 15:43:30 -0700 (PDT)
From: "Raghu Angadi (JIRA)" <jira@apache.org>
To: hadoop-dev@lucene.apache.org
Subject: [jira] Commented: (HADOOP-1687) Name-node memory size estimates and
 optimization proposal.
In-Reply-To: <2978721.1186436639247.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523377 ] 

Raghu Angadi commented on HADOOP-1687:
--------------------------------------

Since we want to compact BlockInfo, we could also merge DatanodeDescriptor array and list references array. This will save 24 bytes per block. So the array would be Object[3*replication]. Of course all the hairy details of management of this array would be private to this class.


> Name-node memory size estimates and optimization proposal.
> ----------------------------------------------------------
>
>                 Key: HADOOP-1687
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1687
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.13.1
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>             Fix For: 0.15.0
>
>
> I've done some estimates on how much space our data structures take on the name-node per block, file and directory.
> Brief overview of the data structures:
> Directory tree (FSDirectory) is built of inodes. Each INode points either to an array of blocks
> if it corresponds to a file or to a TreeMap<String, INode> of children INodes if it is a directory.
> [Note: this estimates were made before Dhruba replaced the children TreeMap by ArrayList.]
> Each block participates also in at least 2 more data structures.
> BlocksMap contains a HashMap<Block, BlockInfo> of all blocks mapping a Block into a BlockInfo.
> DatanodeDescriptor contains a TreeMap<Block, Block> of all blocks belonging to this data-node.
> A block may or may not be contained also in other data-structures, like
> {code}
> UnderReplicatedBlocks
> PendingReplicationBlocks
> recentInvalidateSets
> excessReplicateMap
> {code}
> Presence of a block in any of these structures is temporary and therefore I do not count them in my estimates.
> The estimates can be viewed as lower bounds.
> These are some classes that we are looking at here
> {code}
> class INode {
>    String name;
>    INode parent;
>    TreeMap<String, INode> children;
>    Block blocks[];
>    short blockReplication;
>    long modificationTime;
> }
> class Block {
>    long blkid;
>    long len;
> }
> class BlockInfo {
>    FSDirectory.INode       inode;
>    DatanodeDescriptor[]   nodes;
>    Block                          block;
> }
> {code}
> The calculations are made for a 64-bit java vm based on that
> Reference size = 8 bytes
> Object header size = 16 bytes
> Array header size = 24 bytes
> Commonly used objects:
> TreeMap.Entry = 64 bytes. It has 5 reference fields
> HashMap.Entry = 48 bytes. It has 3 reference fields
> String header = 64 bytes.
> The size of a file includes:
> # Size of an empty file INode: INode.children = null, INode.blocks is a 0-length array, and file name is empty. (152 bytes)
> # A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
> # file name length times 2, because String represents each name character by 2 bytes.
> # Reference to the outer FSDirectory class (8 bytes)
> The total: 224 + 2 * fileName.length
> The size of a directory includes:
> # Size of an empty directory INode: INode.children is an empty TreeMap, INode.blocks = null, and file name is empty. (192 bytes)
> # A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
> # file name length times 2
> # Reference to the outer FSDirectory class (8 bytes)
> The total: 264 + 2 * fileName.length
> The size of a block includes:
> # Size of Block. (32 bytes)
> # Size of BlockInfo. (64 + 8*replication bytes)
> # Reference to the block from INode.blocks (8 bytes)
> # HashMap.Entry referencing the block from BlocksMap. (48 bytes)
> # References to the block from all DatanodeDescriptors it belongs to.
> This is a TreeMap.Entry size times block replication. (64 * replication)
> The total: 152 + 72 * replication
> Typical object sizes:
> Taking into account that typical file name is 10-15 bytes and our default replication is 3 we can say that typical sizes are
> File size: 250
> Directory size: 290
> Block size: 368
> ||Object||size estimate (bytes)||typical size (bytes)||
> |File|224 + 2 * fileName.length|250|
> |Directory|264 + 2 * fileName.length|290|
> |Block|152 + 72 * replication|368|
> One of our clusters has
> # Files:  10 600 000
> # Dirs:      310 000
> # Blocks: 13 300 000
> Total Size (estimate): 7,63 GB
> Memory used on the name-node (actual reported by jconsole after gc): 9 GB
> This means that other data structures like NetworkTopology, heartbeats, datanodeMap, Host2NodesMap,
> leases, sortedLeases, and multiple block replication maps occupy ~1.4 GB, which seems to be pretty high
> and need to be investigated as well.
> Based on the above estimates blocks should be the main focus of the name-node memory reduction effort.
> Space used by a block is 50% larger compared to a file, and there is more blocks than files or directories.
> Some ideas on how we can reduce the name-node size without substantially changing the data structures.
> # INode.children should be an ArrayList instead of a TreeMap. Already done HADOOP-1565. (-48 bytes)
> # Factor out the INode class into a separate class (-8 bytes)
> # Create base INode class and derive file inode and directory inode classes from the base.
> Directory inodes do not need to contain blocks and replication fields (-16 bytes)
> File inodes do not need to contain children field (-8 bytes)
> # String name should be replaced by a mere byte[]. (-(40 + fileName.length) ~ -50 bytes)
> # Eliminate the Block object.
> We should move Block fields into into BlockInfo and completely get rid of the Block object. (-16 bytes)
> # Block object is referenced at least 5 times in our structures for each physical block.
> The number of references should be reduced to just 2. (-24)
> # Remove name field from INode. File or directory name is stored in the corresponding directory
> entry and does need to be duplicated in the INode (-8 bytes)
> # Eliminate INode.parent field. INodes are accessed through the directory tree, and the parent can
> be remembered in a local variable while browsing the tree. There is no need to persistently store
> the parent reference for each object. (-8 bytes)
> # Need to optimize data-node to block map. Currently each DatanodeDescriptor holds a TreeMap of
> blocks contained in the node, and we have an overhead of one TreeMap.Entry per block replica.
> I expect we can reorganize datanodeMap in a way that it stores only 1 or 2 references per replica
> instead of an entire TreeMap.Entry. (-48 * replication)
> Note: In general TreeMaps turned out to be very expensive, we should avoid using them if possible.
> Or implement a custom map structure, which would avoid using objects for each map entry.
> This is what we will have after all the optimizations
> ||Object||size estimate (bytes)||typical size (bytes)||current typical size (bytes)||
> |File|112 + fileName.length|125|250|
> |Directory|144 + fileName.length|155|290|
> |Block|112 + 24 * replication|184|368|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.