hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6709) Implement off-heap data structures for NameNode and other HDFS memory optimization
Date Mon, 28 Jul 2014 04:49:41 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075894#comment-14075894

Colin Patrick McCabe commented on HDFS-6709:

bq. I'm just asking leading questions to make sure this approach is sound. Y! stands to lose
a lot if this doesn't actually scale

The questions are good... hopefully the answers are too!  I'm just trying to make my answers
as complete as I can.

bq. To clarify the RTTI, I thought you meant more than just a per-instance reference to the
class would be saved - although saving a reference is indeed great

Yeah.  It will shrink objects by 4 or 8 bytes each.  It's not immaterial!  Savings like these
are why I think it will shrink memory consumption

bq. Regarding atomicity/CAS, it's relevant because using misalignment (over-optimization?)
prevents adding concurrency to data structures that aren't but should allow concurrency. I

Isn't this a minor implementation detail, though?  We don't currently use atomic ops on these
data structures.  If we go ahead with a layout that uses unaligned access, and someone later
decides to make things atomic, we can always switch to an aligned layout.

bq. I know about generational collection but I'm admittedly not an expert. Which young gen
GC method does not pause? ParNew+CMS definitively pauses... Here are some quickly gathered
12-day observations from a moderately loaded, multi-thousand node, non-production cluster:

I'm not a GC expert either.  But from what I've read, "does not pause" is a pretty high bar
to clear.  I think even Azul's GC pauses on occasion for sub-millisecond intervals.  For CMS
and G1, everything I've read talks about tuning the young-gen collection in terms of target
pause times.

bq. We have production clusters over 2.5X larger that sustained over 3X ops/sec. This non-prod
cluster is generating ~625MB of garbage/sec. How do you predict dynamic instantiation of INode
and BlockInfo objects will help? They generally won't be promoted to old gen which should
reduce the infrequent CMS collection times. BUT, will it dramatically increase the frequency
of young collection and/or lead to premature tenuring?

If you look at the code, we create temporary objects all over the place.

For example, look at setTimes:

  private void setTimesInt(String src, long mtime, long atime)
    throws IOException, UnresolvedLinkException {
    HdfsFileStatus resultingStat = null;
    FSPermissionChecker pc = getPermissionChecker();
    byte[][] pathComponents = FSDirectory.getPathComponentsForReservedPath(src);
    try {
      checkNameNodeSafeMode("Cannot set times " + src);
      src = FSDirectory.resolvePath(src, pathComponents, dir);

      // Write access is required to set access and modification times
      if (isPermissionEnabled) {
        checkPathAccess(pc, src, FsAction.WRITE);
      final INodesInPath iip = dir.getINodesInPath4Write(src);
      final INode inode = iip.getLastINode();

You can see we create:
HdfsFileStatus (with at least 5 sub-objects.  one of those, FsPermission, has 3 sub-objects
of its own)
FSPermissionChecker (which has at least 5 sub-objects inside it)
new src string
INodesInPath (at least 2 sub-objects of its own)

That's at least 21 temporary objects just in this code snippet, and I'm sure I missed a lot
of things.  I'm not including any of the functions that called or were called by this function,
or any of the RPC or protobuf machinations.  The average path depth is maybe between 5 and
8... would having 5 to 8 extra temporary objects to represent INodes we traversed substantially
increase the GC load?  I would say no.

Maybe you think I've chosen an easy example.  Hmm... the operation that I can think of that
touches the most inodes is recursive delete.  But we've known about the problems with this
for a while... that's why JIRAs like HDFS-2938 addressed the problem.  Arguably, an off-heap
implementation is actually better here since we avoid creating a lot of trash in the tenured
generation.  And trash in the tenured generation leads to heap fragmentations (at least in
CMS), and the dreaded full GC.

> Implement off-heap data structures for NameNode and other HDFS memory optimization
> ----------------------------------------------------------------------------------
>                 Key: HDFS-6709
>                 URL: https://issues.apache.org/jira/browse/HDFS-6709
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-6709.001.patch
> We should investigate implementing off-heap data structures for NameNode and other HDFS
memory optimization.  These data structures could reduce latency by avoiding the long GC times
that occur with large Java heaps.  We could also avoid per-object memory overheads and control
memory layout a little bit better.  This also would allow us to use the JVM's "compressed
oops" optimization even with really large namespaces, if we could get the Java heap below
32 GB for those cases.  This would provide another performance and memory efficiency boost.

This message was sent by Atlassian JIRA

View raw message