hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Trezzo (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4
Date Wed, 02 Dec 2015 17:17:11 GMT

    [ https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036168#comment-15036168

Chris Trezzo commented on HDFS-8791:

If I have my Twitter hat on, like [~jrottinghuis] said, we have already gained the benefits
of this patch internally because we have back-ported it to our 2.6.2 branch. From that perspective,
I would be happy if this patch simply made it into the next branch-2 minor release.

On the other hand, if I have my community hat on, I am wondering how many hadoop users would
want this patch and, if that group is large enough, what is the best way to get the patch
to them on a stable release.

1. How many people would want this patch?: I think this will affect all hadoop clusters that
have seen over 16 million blocks written to the entire cluster over its lifespan and are running
ext4. As a reminder, data node startup time and potentially IO perf of user level containers
will start to degrade before this point (as the directory structure grows, the impact becomes
greater). I would say that most large hadoop users fall into this category. My guess is that
a non-trivial number of production hadoop clusters for medium size users would fall into this
category as well. [~andrew.wang] I am sure you would have a better sense for how many production
clusters this would affect.

2. How do we get this patch out to users on a stable release?: I definitely understand the
desire to avoid a layout change as part of a maintenance release, but I also think it would
be nice to have a stable release that users could deploy with this patch. Here is one potential
* Since 2.8 is cut but not released, rename the 2.8 branch to 2.9 and continue with the release
schedule it is currently on.
* Cut a new 2.8 branch off of 2.7.3 and apply this patch to this "new" 2.8.
* Going forward:
** People that are adverse to making the layout change can continue doing maintenance releases
on the 2.7 line. My guess is that this is a small group and that the 2.7 branch will essentially
** Maintenance releases can continue on the new 2.8 branch as they would have for the 2.7
branch. People that were on 2.7 should be able to easily move to 2.8 because it is essentially
a maintenance release plus the new layout.
* I would say that there is no need to back-port the layout change to the 2.6 branch if we
have a stable 2.8 that users can upgrade to.

With this scenario we get a stable release with the new layout (i.e. the new 2.8 branch) and
we avoid making a layout change in a maintenance release. Thoughts?

> block ID-based DN storage layout can be very slow for datanode on ext4
> ----------------------------------------------------------------------
>                 Key: HDFS-8791
>                 URL: https://issues.apache.org/jira/browse/HDFS-8791
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.6.0, 2.8.0, 2.7.1
>            Reporter: Nathan Roberts
>            Assignee: Chris Trezzo
>            Priority: Blocker
>         Attachments: 32x32DatanodeLayoutTesting-v1.pdf, 32x32DatanodeLayoutTesting-v2.pdf,
HDFS-8791-trunk-v1.patch, HDFS-8791-trunk-v2-bin.patch, HDFS-8791-trunk-v2.patch, HDFS-8791-trunk-v2.patch,
> We are seeing cases where the new directory layout causes the datanode to basically cause
the disks to seek for 10s of minutes. This can be when the datanode is running du, and it
can also be when it is performing a checkDirs(). Both of these operations currently scan all
directories in the block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf directories
where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely the system
is to prune those entries (vfs_cache_pressure). However, ext4 relies on the buffer cache to
cache the directory blocks and I'm not aware of any way to tell linux to favor buffer cache
pages (even if it did I'm not sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, this basically
means the 64K directory blocks are probably randomly spread across the entire disk. A du type
scan will look at directories one at a time, so the ioscheduler can't optimize the corresponding
seeks, meaning the seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things are hot
is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer cache out,
causing the next DU to hit this situation. We are seeing cases where a large job can cause
a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would be a few
hundred directory blocks. Even when completely cold, these would only take a few a hundred
seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks getting modified
is quite high, this keeps those blocks hot and much less likely to be evicted.

This message was sent by Atlassian JIRA

View raw message