Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 27861 invoked from network); 28 Aug 2007 22:43:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 28 Aug 2007 22:43:54 -0000 Received: (qmail 24682 invoked by uid 500); 28 Aug 2007 22:43:48 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 24662 invoked by uid 500); 28 Aug 2007 22:43:48 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 24653 invoked by uid 99); 28 Aug 2007 22:43:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Aug 2007 15:43:48 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Aug 2007 22:43:52 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id F23DE714208 for ; Tue, 28 Aug 2007 15:43:30 -0700 (PDT) Message-ID: <21964799.1188341010989.JavaMail.jira@brutus> Date: Tue, 28 Aug 2007 15:43:30 -0700 (PDT) From: "Raghu Angadi (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-1687) Name-node memory size estimates and optimization proposal. In-Reply-To: <2978721.1186436639247.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523377 ] Raghu Angadi commented on HADOOP-1687: -------------------------------------- Since we want to compact BlockInfo, we could also merge DatanodeDescriptor array and list references array. This will save 24 bytes per block. So the array would be Object[3*replication]. Of course all the hairy details of management of this array would be private to this class. > Name-node memory size estimates and optimization proposal. > ---------------------------------------------------------- > > Key: HADOOP-1687 > URL: https://issues.apache.org/jira/browse/HADOOP-1687 > Project: Hadoop > Issue Type: Improvement > Components: dfs > Affects Versions: 0.13.1 > Reporter: Konstantin Shvachko > Assignee: Konstantin Shvachko > Fix For: 0.15.0 > > > I've done some estimates on how much space our data structures take on the name-node per block, file and directory. > Brief overview of the data structures: > Directory tree (FSDirectory) is built of inodes. Each INode points either to an array of blocks > if it corresponds to a file or to a TreeMap of children INodes if it is a directory. > [Note: this estimates were made before Dhruba replaced the children TreeMap by ArrayList.] > Each block participates also in at least 2 more data structures. > BlocksMap contains a HashMap of all blocks mapping a Block into a BlockInfo. > DatanodeDescriptor contains a TreeMap of all blocks belonging to this data-node. > A block may or may not be contained also in other data-structures, like > {code} > UnderReplicatedBlocks > PendingReplicationBlocks > recentInvalidateSets > excessReplicateMap > {code} > Presence of a block in any of these structures is temporary and therefore I do not count them in my estimates. > The estimates can be viewed as lower bounds. > These are some classes that we are looking at here > {code} > class INode { > String name; > INode parent; > TreeMap children; > Block blocks[]; > short blockReplication; > long modificationTime; > } > class Block { > long blkid; > long len; > } > class BlockInfo { > FSDirectory.INode inode; > DatanodeDescriptor[] nodes; > Block block; > } > {code} > The calculations are made for a 64-bit java vm based on that > Reference size = 8 bytes > Object header size = 16 bytes > Array header size = 24 bytes > Commonly used objects: > TreeMap.Entry = 64 bytes. It has 5 reference fields > HashMap.Entry = 48 bytes. It has 3 reference fields > String header = 64 bytes. > The size of a file includes: > # Size of an empty file INode: INode.children = null, INode.blocks is a 0-length array, and file name is empty. (152 bytes) > # A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes) > # file name length times 2, because String represents each name character by 2 bytes. > # Reference to the outer FSDirectory class (8 bytes) > The total: 224 + 2 * fileName.length > The size of a directory includes: > # Size of an empty directory INode: INode.children is an empty TreeMap, INode.blocks = null, and file name is empty. (192 bytes) > # A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes) > # file name length times 2 > # Reference to the outer FSDirectory class (8 bytes) > The total: 264 + 2 * fileName.length > The size of a block includes: > # Size of Block. (32 bytes) > # Size of BlockInfo. (64 + 8*replication bytes) > # Reference to the block from INode.blocks (8 bytes) > # HashMap.Entry referencing the block from BlocksMap. (48 bytes) > # References to the block from all DatanodeDescriptors it belongs to. > This is a TreeMap.Entry size times block replication. (64 * replication) > The total: 152 + 72 * replication > Typical object sizes: > Taking into account that typical file name is 10-15 bytes and our default replication is 3 we can say that typical sizes are > File size: 250 > Directory size: 290 > Block size: 368 > ||Object||size estimate (bytes)||typical size (bytes)|| > |File|224 + 2 * fileName.length|250| > |Directory|264 + 2 * fileName.length|290| > |Block|152 + 72 * replication|368| > One of our clusters has > # Files: 10 600 000 > # Dirs: 310 000 > # Blocks: 13 300 000 > Total Size (estimate): 7,63 GB > Memory used on the name-node (actual reported by jconsole after gc): 9 GB > This means that other data structures like NetworkTopology, heartbeats, datanodeMap, Host2NodesMap, > leases, sortedLeases, and multiple block replication maps occupy ~1.4 GB, which seems to be pretty high > and need to be investigated as well. > Based on the above estimates blocks should be the main focus of the name-node memory reduction effort. > Space used by a block is 50% larger compared to a file, and there is more blocks than files or directories. > Some ideas on how we can reduce the name-node size without substantially changing the data structures. > # INode.children should be an ArrayList instead of a TreeMap. Already done HADOOP-1565. (-48 bytes) > # Factor out the INode class into a separate class (-8 bytes) > # Create base INode class and derive file inode and directory inode classes from the base. > Directory inodes do not need to contain blocks and replication fields (-16 bytes) > File inodes do not need to contain children field (-8 bytes) > # String name should be replaced by a mere byte[]. (-(40 + fileName.length) ~ -50 bytes) > # Eliminate the Block object. > We should move Block fields into into BlockInfo and completely get rid of the Block object. (-16 bytes) > # Block object is referenced at least 5 times in our structures for each physical block. > The number of references should be reduced to just 2. (-24) > # Remove name field from INode. File or directory name is stored in the corresponding directory > entry and does need to be duplicated in the INode (-8 bytes) > # Eliminate INode.parent field. INodes are accessed through the directory tree, and the parent can > be remembered in a local variable while browsing the tree. There is no need to persistently store > the parent reference for each object. (-8 bytes) > # Need to optimize data-node to block map. Currently each DatanodeDescriptor holds a TreeMap of > blocks contained in the node, and we have an overhead of one TreeMap.Entry per block replica. > I expect we can reorganize datanodeMap in a way that it stores only 1 or 2 references per replica > instead of an entire TreeMap.Entry. (-48 * replication) > Note: In general TreeMaps turned out to be very expensive, we should avoid using them if possible. > Or implement a custom map structure, which would avoid using objects for each map entry. > This is what we will have after all the optimizations > ||Object||size estimate (bytes)||typical size (bytes)||current typical size (bytes)|| > |File|112 + fileName.length|125|250| > |Directory|144 + fileName.length|155|290| > |Block|112 + 24 * replication|184|368| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.