hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2003) Separate FSEditLog reading logic from editLog memory state building logic
Date Fri, 03 Jun 2011 23:35:47 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044129#comment-13044129

Todd Lipcon commented on HDFS-2003:

h3. Test stup
I ran performance tests with an fsimage/edits pair I had from a real life cluster. The fsimage
is about ~2G and has 12.5M files, and the edit log is exactly 2GB (I truncated it with dd
to that length). I ran the NN with the following JVM options: -Xms14g -Xmx14g -XX:+UseCompressedOops.

h3. With Parallel (default) GC:
I loaded the edit log 3 times each with the patch and without the patch from a local SATA

Without the patch, the logs loaded in 84 seconds (consistent across the 3 runs). With the
patch, it loaded in 87s, consistent across the three runs.

h3. With CMS GC:
I then added the JVM option: -XX:+UseConcMarkSweepGC, since that's more likely the GC in use
on most large clusters.

With the patch: Loaded in 86 seconds and incurred 213 young generation collections while loading
the edit log, which added up to a total of 2.208 seconds in young gen GC.
Without the patch: 84 seconds, 211 young gen GCs, adding up to 2.174 seconds.

h3. Summary

The patch seems to have a very marginal impact on amount of time spent in GC, which makes
sense since the objects are very short-lived and young-generation GC time is proportional
to live object size, not garbage size. The patch seems to have about a 3-4% negative impact
on overall wall clock time of loading the log.

Do you guys think this is acceptable? In most of the clusters I see, edit logs tend to be
much smaller than this, and startup time is dominated by loading the image and collecting
block reports, not edits replay. So, I tend to think the improved code cleanliness of this
patch is worth the perf hit.

> Separate FSEditLog reading logic from editLog memory state building logic
> -------------------------------------------------------------------------
>                 Key: HDFS-2003
>                 URL: https://issues.apache.org/jira/browse/HDFS-2003
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: Edit log branch (HDFS-1073)
>            Reporter: Ivan Kelly
>            Assignee: Ivan Kelly
>             Fix For: Edit log branch (HDFS-1073)
>         Attachments: HDFS-2003.diff, HDFS-2003.diff, HDFS-2003.diff
> Currently FSEditLogLoader has code for reading from an InputStream interleaved with code
which updates the FSNameSystem and FSDirectory. This makes it difficult to read an edit log
without having a whole load of other object initialised, which is problematic if you want
to do things like count how many transactions are in a file etc. 
> This patch separates the reading of the stream and the building of the memory state.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message