hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1110) Namenode heap optimization - reuse objects for commonly used file names
Date Mon, 26 Apr 2010 18:45:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861042#action_12861042

Suresh Srinivas commented on HDFS-1110:

- Commonly used file names are defined by a regex in a config file. The config file is preconfgured
with part-00.* and part-?-00.* regex for file names created by mapreduce. It covers part-00000
to part-00999, part-m-00000 to part-m-00999 and part-r-00000 to part-r-00999.
- When creating a INodeFile, for names that match the regex, add an entry of file name to
byte[] for the first time. For subsequent creation of INodeFile, reuse the existing byte[].
- Clusters where there are other common file names, those names can be added to the config
file by the cluster admin.
- Max size to which dictionary can grow to will be set to prevent a poor choice of regex (example
.*) from over using the heap.

Alternative approach:
- During startup, while loading fsimage, the number of times a file name occurs can be counted
(uses a lot of heap) and dictionary can be setup with top N recurring names.

Alternative approach has the advantage that regex file to define names that need to be added
dictionary is not requried. It does not work when the namenode starts fresh or recurring names
get added post startup. 

I am planning to go with approach 1.

> Namenode heap optimization - reuse objects for commonly used file names
> -----------------------------------------------------------------------
>                 Key: HDFS-1110
>                 URL: https://issues.apache.org/jira/browse/HDFS-1110
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Suresh Srinivas
>            Assignee: Suresh Srinivas
>             Fix For: 0.22.0
> There are a lot of common file names used in HDFS, mainly created by mapreduce, such
as file names starting with "part". Reusing byte[] corresponding to these recurring file names
will save significant heap space used for storing the file names in millions of INodeFile

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message