Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 868AD182DF for ; Tue, 21 Jul 2015 22:06:05 +0000 (UTC) Received: (qmail 15459 invoked by uid 500); 21 Jul 2015 22:06:05 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 15402 invoked by uid 500); 21 Jul 2015 22:06:05 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 15388 invoked by uid 99); 21 Jul 2015 22:06:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jul 2015 22:06:05 +0000 Date: Tue, 21 Jul 2015 22:06:05 +0000 (UTC) From: "Nathan Roberts (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635908#comment-14635908 ] Nathan Roberts commented on HDFS-8791: -------------------------------------- bq. I'm having trouble understanding these kernel settings. http://www.gluster.org/community/documentation/index.php/Linux_Kernel_Tuning says that "When vfs_cache_pressure=0, the kernel will never reclaim dentries and inodes due to memory pressure and this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 causes the kernel to prefer to reclaim dentries and inodes." So that would seem to indicate that vfs_cache_pressure does have control over dentries (i.e. the "directory blocks" which contain the list of child inodes). What settings have you used for vfs_cache_pressure so far? Not a linux filesystem expert, but here's where I think the confusion is: - inodes are cached in ext4_inode slab - dentries are cached in dentry slab - directory blocks are cached in the buffer cache - lookups (e.g. stat /subdir1/subdir2/blk_00000) can be satisfied with the dentry+inode cache - readdir cannot be satisfied by the dentry cache, it needs to see the blocks from the disk (hence the buffer cache) I can somewhat protect the inode+dentry by setting vfs_cache_pressure to 1 (setting to 0 can be very bad because negative dentries can fill up your entire memory, I think). I tried setting vfs_cache_pressure to 0, and it didn't seem to help the case we are seeing. I used blktrace to capture what was happening when a node was doing this. I then dumped the raw data at the offsets captured by blktrace. The data showed that the seeks were all the result of reading directory blocks, not inodes. bq. I think if we're going to change the on-disk layout format again, we should change the way we name meta files. Currently, we encode the genstamp in the file name, like blk_1073741915_1091.meta. This means that to look up the meta file for block 1073741915, we have to iterate through every file in the subdirectory until we find it. Instead, we could simply name the meta file as blk_107374191.meta and put the genstamp number in the meta file header. This would allow us to move to a scheme which had a very large number of blocks in each directory (perhaps a simple 1-level hashing scheme) and the dentries would always be "hot". ext4 and other modern Linux filesystems deal very effectively with large directories-- it's only ext2 and ext3 without certain options enabled that had problems. I'm a little confused about iterating to find the meta file. Don't we already keep track of the genstamp we discovered during startup? If so, it seems like a simple stat is sufficient. I haven't tried xfs, but that would also be a REALLY heavy hammer in our case;) > block ID-based DN storage layout can be very slow for datanode on ext4 > ---------------------------------------------------------------------- > > Key: HDFS-8791 > URL: https://issues.apache.org/jira/browse/HDFS-8791 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.6.0 > Reporter: Nathan Roberts > Priority: Critical > > We are seeing cases where the new directory layout causes the datanode to basically cause the disks to seek for 10s of minutes. This can be when the datanode is running du, and it can also be when it is performing a checkDirs(). Both of these operations currently scan all directories in the block pool and that's very expensive in the new layout. > The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf directories where block files are placed. > So, what we have on disk is: > - 256 inodes for the first level directories > - 256 directory blocks for the first level directories > - 256*256 inodes for the second level directories > - 256*256 directory blocks for the second level directories > - Then the inodes and blocks to store the the HDFS blocks themselves. > The main problem is the 256*256 directory blocks. > inodes and dentries will be cached by linux and one can configure how likely the system is to prune those entries (vfs_cache_pressure). However, ext4 relies on the buffer cache to cache the directory blocks and I'm not aware of any way to tell linux to favor buffer cache pages (even if it did I'm not sure I would want it to in general). > Also, ext4 tries hard to spread directories evenly across the entire volume, this basically means the 64K directory blocks are probably randomly spread across the entire disk. A du type scan will look at directories one at a time, so the ioscheduler can't optimize the corresponding seeks, meaning the seeks will be random and far. > In a system I was using to diagnose this, I had 60K blocks. A DU when things are hot is less than 1 second. When things are cold, about 20 minutes. > How do things get cold? > - A large set of tasks run on the node. This pushes almost all of the buffer cache out, causing the next DU to hit this situation. We are seeing cases where a large job can cause a seek storm across the entire cluster. > Why didn't the previous layout see this? > - It might have but it wasn't nearly as pronounced. The previous layout would be a few hundred directory blocks. Even when completely cold, these would only take a few a hundred seeks which would mean single digit seconds. > - With only a few hundred directories, the odds of the directory blocks getting modified is quite high, this keeps those blocks hot and much less likely to be evicted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)