Return-Path: Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: (qmail 4184 invoked from network); 11 May 2010 04:25:54 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 11 May 2010 04:25:54 -0000 Received: (qmail 81960 invoked by uid 500); 11 May 2010 04:25:54 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 81874 invoked by uid 500); 11 May 2010 04:25:54 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 81863 invoked by uid 99); 11 May 2010 04:25:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 May 2010 04:25:53 +0000 X-ASF-Spam-Status: No, hits=-1411.8 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 May 2010 04:25:52 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o4B4PWgp002083 for ; Tue, 11 May 2010 04:25:32 GMT Message-ID: <4933212.14241273551932006.JavaMail.jira@thor> Date: Tue, 11 May 2010 00:25:32 -0400 (EDT) From: "Suresh Srinivas (JIRA)" To: hdfs-issues@hadoop.apache.org Subject: [jira] Commented: (HDFS-1110) Namenode heap optimization - reuse objects for commonly used file names In-Reply-To: <26851282.14551272307412975.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866052#action_12866052 ] Suresh Srinivas commented on HDFS-1110: --------------------------------------- bq. agree with Dhruba that we need to optimize only for the top ten (or so) file names, which will give us 5% saving in the meta As I had commented earlier, we need to have at least a dictionary of size 500K to see gains, not just top ten file names. See the breakdown from my previous analysis. Regex matching should not be a big issue. Look up is always made in the dictionary hashmap (for most of the frequently used file names, there will be a hit). Regex is used only for names not found in the dictionary. This is done only while creating a file. After thinking about this solution, here is an alternate solution I am leaning towards. Let me know what you guys think: # During startup, maintain two maps. One is a transient map used for counting number of times a file name is used and the corresponding byte[]. The second is the dictionary map which maintains file name to byte[]. # While consuming fsimage and editslog: #* If name is found in the dictionary map, use byte[] corresponding to it #* if name is found in the transient map, increment the number of times the name is used. If the name is used more than threshold (10 times), delete it from the transient map and promote it to dictionary. #* if name is not found in the transient map, add it to the transient map with use count set to 1. #* At the end of consuming editslog, delete the transient map. Advantages: # No configuration files and no regex. Simplified administration. Disadvantages: # Dictionary is initialized only during startup. Hence it does not react to and optimize for files names that become popular post startup. # Impacts startup time due to two hashmap lookups (though it should be a small fraction of disk i/o time during startup) > Namenode heap optimization - reuse objects for commonly used file names > ----------------------------------------------------------------------- > > Key: HDFS-1110 > URL: https://issues.apache.org/jira/browse/HDFS-1110 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Suresh Srinivas > Assignee: Suresh Srinivas > Fix For: 0.22.0 > > Attachments: hdfs-1110.2.patch, hdfs-1110.patch > > > There are a lot of common file names used in HDFS, mainly created by mapreduce, such as file names starting with "part". Reusing byte[] corresponding to these recurring file names will save significant heap space used for storing the file names in millions of INodeFile objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.