Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Message-ID: <17595476.1175108305371.JavaMail.jira@brutus>
Date: Wed, 28 Mar 2007 11:58:25 -0700 (PDT)
From: "Hairong Kuang (JIRA)" <jira@apache.org>
To: hadoop-dev@lucene.apache.org
Subject: [jira] Commented: (HADOOP-1170) Very high CPU usage on data nodes
 because of FSDataset.checkDataDir() on every connect
In-Reply-To: <20559471.1175057192137.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484961 ] 

Hairong Kuang commented on HADOOP-1170:
---------------------------------------

I agree that it is too costly to call checkDirs on every I/O operation. A background thread that periodically does the sanity check would be nicer.

The patch should also clean up the code that does the error handling.

> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1170
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1170
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.11.2
>            Reporter: Igor Bolotin
>         Attachments: 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
>         at java.io.UnixFileSystem.checkAccess(Native Method)
>         at java.io.File.canRead(File.java:660)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
>         at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
>         at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
>         - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>         at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
>         at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
>         at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.