Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Message-ID: <4933212.14241273551932006.JavaMail.jira@thor>
Date: Tue, 11 May 2010 00:25:32 -0400 (EDT)
From: "Suresh Srinivas (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Subject: [jira] Commented: (HDFS-1110) Namenode heap optimization - reuse
 objects for commonly used file names
In-Reply-To: <26851282.14551272307412975.JavaMail.jira@thor>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866052#action_12866052 ] 

Suresh Srinivas commented on HDFS-1110:
---------------------------------------

bq. agree with Dhruba that we need to optimize only for the top ten (or so) file names, which will give us 5% saving in the meta

As I had commented earlier, we need to have at least a dictionary of size 500K to see gains, not just top ten file names. See the breakdown from my previous analysis. Regex matching should not be a big issue. Look up is always made in the dictionary hashmap (for most of the frequently used file names, there will be a hit). Regex is used only for names not found in the dictionary. This is done only while creating a file.

After thinking about this solution, here is an alternate solution I am leaning towards. Let me know what you guys think:
# During startup, maintain two maps. One is a transient map used for counting number of times a file name is used and the corresponding byte[]. The second is the dictionary map which maintains file name to byte[]. 
# While consuming fsimage and editslog:
#* If name is found in the dictionary map, use byte[] corresponding to it
#* if name is found in the transient map, increment the number of times the name is used. If the name is used more than threshold (10 times), delete it from the transient map and promote it to dictionary.
#* if name is not found in the transient map, add it to the transient map with use count set to 1.
#* At the end of consuming editslog, delete the transient map.

Advantages:
# No configuration files and no regex. Simplified administration.

Disadvantages:
# Dictionary is initialized only during startup. Hence it does not react to and optimize for files names that become popular post startup.
# Impacts startup time  due to two hashmap lookups (though it should be a small fraction of disk i/o time during startup)


> Namenode heap optimization - reuse objects for commonly used file names
> -----------------------------------------------------------------------
>
>                 Key: HDFS-1110
>                 URL: https://issues.apache.org/jira/browse/HDFS-1110
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Suresh Srinivas
>            Assignee: Suresh Srinivas
>             Fix For: 0.22.0
>
>         Attachments: hdfs-1110.2.patch, hdfs-1110.patch
>
>
> There are a lot of common file names used in HDFS, mainly created by mapreduce, such as file names starting with "part". Reusing byte[] corresponding to these recurring file names will save significant heap space used for storing the file names in millions of INodeFile objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.