hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HDFS-1110) Namenode heap optimization - reuse objects for commonly used file names
Date Fri, 30 Apr 2010 01:42:55 GMT

     [ https://issues.apache.org/jira/browse/HDFS-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Suresh Srinivas updated HDFS-1110:
----------------------------------

    Attachment: hdfs-1110.2.patch

bq. What are the names of these 24 files? Do they fall under the proposed default pattern.
How big is the noise if we use the default pattern.
Of 24, 22 are part-* files.

bq. we need to optimize only for the top ten (or so) file names, which will give us 5% saving
in the meta-data memory footprint
I do not think top 10 will save 5% of meta-data memory fooprint. See the posted results below.

I have a bug in my previous calculation, that made the savings seem too good to be true. With
47 million files optimized to use the dictionary, the saving of 10 bytes gives 470MB and not
4.7GB :-) Also I did not account for byte[] overhead of 24 bytes.

Any way I have a tool NamespaceDedupe with the new patch. You could run on fsimage to see
the frequency of occurence and savings in heap size. Dhruba you can run this on images on
your production cluster to see how savings compare with what I have posted below.

23 names are used by 3343781 between 100000 and 360461 times. Saved space 114962311
468 names are used by 12944154 between 10000 and 100000 times. Saved space 448255164
4335 names are used by 10522601 between 1000 and 10000 times. Saved space 391364352
40031 names are used by 10654372 between 100 and 1000 times. Saved space 382273386
403974 names are used by10722689 between 10 and 100 times. Saved space 354416484
Total saved space 1691271697


> Namenode heap optimization - reuse objects for commonly used file names
> -----------------------------------------------------------------------
>
>                 Key: HDFS-1110
>                 URL: https://issues.apache.org/jira/browse/HDFS-1110
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Suresh Srinivas
>            Assignee: Suresh Srinivas
>             Fix For: 0.22.0
>
>         Attachments: hdfs-1110.2.patch, hdfs-1110.patch
>
>
> There are a lot of common file names used in HDFS, mainly created by mapreduce, such
as file names starting with "part". Reusing byte[] corresponding to these recurring file names
will save significant heap space used for storing the file names in millions of INodeFile
objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message