hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1110) Namenode heap optimization - reuse objects for commonly used file names
Date Tue, 11 May 2010 04:25:32 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866052#action_12866052
] 

Suresh Srinivas commented on HDFS-1110:
---------------------------------------

bq. agree with Dhruba that we need to optimize only for the top ten (or so) file names, which
will give us 5% saving in the meta

As I had commented earlier, we need to have at least a dictionary of size 500K to see gains,
not just top ten file names. See the breakdown from my previous analysis. Regex matching should
not be a big issue. Look up is always made in the dictionary hashmap (for most of the frequently
used file names, there will be a hit). Regex is used only for names not found in the dictionary.
This is done only while creating a file.

After thinking about this solution, here is an alternate solution I am leaning towards. Let
me know what you guys think:
# During startup, maintain two maps. One is a transient map used for counting number of times
a file name is used and the corresponding byte[]. The second is the dictionary map which maintains
file name to byte[]. 
# While consuming fsimage and editslog:
#* If name is found in the dictionary map, use byte[] corresponding to it
#* if name is found in the transient map, increment the number of times the name is used.
If the name is used more than threshold (10 times), delete it from the transient map and promote
it to dictionary.
#* if name is not found in the transient map, add it to the transient map with use count set
to 1.
#* At the end of consuming editslog, delete the transient map.

Advantages:
# No configuration files and no regex. Simplified administration.

Disadvantages:
# Dictionary is initialized only during startup. Hence it does not react to and optimize for
files names that become popular post startup.
# Impacts startup time  due to two hashmap lookups (though it should be a small fraction of
disk i/o time during startup)


> Namenode heap optimization - reuse objects for commonly used file names
> -----------------------------------------------------------------------
>
>                 Key: HDFS-1110
>                 URL: https://issues.apache.org/jira/browse/HDFS-1110
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Suresh Srinivas
>            Assignee: Suresh Srinivas
>             Fix For: 0.22.0
>
>         Attachments: hdfs-1110.2.patch, hdfs-1110.patch
>
>
> There are a lot of common file names used in HDFS, mainly created by mapreduce, such
as file names starting with "part". Reusing byte[] corresponding to these recurring file names
will save significant heap space used for storing the file names in millions of INodeFile
objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message