hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1110) Namenode heap optimization - reuse objects for commonly used file names
Date Sat, 05 Jun 2010 00:18:01 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875805#action_12875805

Konstantin Shvachko commented on HDFS-1110:

# variable {{cache}} should read {{nameCache}}
# comment for it should be transformed to JavaDoc comments.
# {{FSDirectory.cache}} should be initialized in the constructor rather than during declaration.
And 10 should be declared as a constant.
# I would consider using NameCache<byte[]> instead of NameCache<ByteArray>. 
You get less objects and conversions, if of course I didn't miss anything here.
# Introduce {{FSDirectory.cache(INode)}} method, which calls NameCache.put().
# In NameCache some comments need clarification
#- "This class has two phases"
Probably something else has 2 phases.
#- "This class must be synchronized externally"
#- Member inline comments should be transformed into javadoc.
# NameCache.cache should be initialized in the constructor rather than during declaration.
# {{UseCount}} should probably be a private inner (rather than static) class, 
and should use the same parameter K with which NameCache<K> is parametrized.
private class UseCount {
    int count;  // Number of times a name occurs
    final K value;  // Internal value for the name

    UseCount(final K value) {
      this.value = value;
# {{UseCount.count}} should be initialized in the constructor. It is better to have increment()
and get() methods rather than accessing count directly from the outside.

I like the idea of using the useThreshold to determine names that should be promoted to the
My main concern is, that the threshold is 10. This means there will a lot of names in the
And all these names are in a HashTable, which has a huge overhead, as we know from another
We still save space, but for names that occur only 10 times the savings are probably negligible.

I would imagine that only 5% or 10% of the most frequently used names get promoted.
It is fine with me to use this simple promoting scheme as a starting point, with an intention
optimize it later. But I would increase the useThreshold to 1000 or so.

Should we make it configurable? Could be useful for testing.

> Namenode heap optimization - reuse objects for commonly used file names
> -----------------------------------------------------------------------
>                 Key: HDFS-1110
>                 URL: https://issues.apache.org/jira/browse/HDFS-1110
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Suresh Srinivas
>            Assignee: Suresh Srinivas
>             Fix For: 0.22.0
>         Attachments: hdfs-1110.2.patch, hdfs-1110.3.patch, hdfs-1110.4.patch, hdfs-1110.patch
> There are a lot of common file names used in HDFS, mainly created by mapreduce, such
as file names starting with "part". Reusing byte[] corresponding to these recurring file names
will save significant heap space used for storing the file names in millions of INodeFile

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message