Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 97433 invoked from network); 28 Mar 2007 02:40:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 28 Mar 2007 02:40:23 -0000 Received: (qmail 16389 invoked by uid 500); 28 Mar 2007 02:40:29 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 16368 invoked by uid 500); 28 Mar 2007 02:40:29 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 16359 invoked by uid 99); 28 Mar 2007 02:40:29 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Mar 2007 19:40:29 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: neutral (herse.apache.org: local policy) Received: from [64.78.20.46] (HELO exsmtp012-2.exch012.intermedia.net) (64.78.20.46) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 27 Mar 2007 19:40:20 -0700 Received: from EXVBE012-1.exch012.intermedia.net ([64.78.20.16]) by exsmtp012-2.exch012.intermedia.net with Microsoft SMTPSVC(6.0.3790.1830); Tue, 27 Mar 2007 19:39:59 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C770E2.40EA1C8F" Subject: Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect Date: Tue, 27 Mar 2007 19:39:00 -0700 Message-ID: <8E2AE6006D6A584F98D5CD65F4801BFE05135D70@EXVBE012-1.exch012.intermedia.net> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect Thread-Index: Acdw4j2ueKEUZCFSRFOOVXv3e9uDpg== From: "Igor Bolotin" To: X-OriginalArrivalTime: 28 Mar 2007 02:39:59.0091 (UTC) FILETIME=[60B78C30:01C770E2] X-Virus-Checked: Checked by ClamAV on apache.org ------_=_NextPart_001_01C770E2.40EA1C8F Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes. Stack trace showed following on most of the data nodes: =20 "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon = prio=3D1 tid=3D0x00002aaacb5b7bd0 nid=3D0x5940 runnable [0x000000004166a000..0x000000004166ac00] at java.io.UnixFileSystem.checkAccess(Native Method) at java.io.File.canRead(File.java:660) at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168) at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258) at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339 ) - locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet) at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544) at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535) at java.lang.Thread.run(Thread.java:595) =20 I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.=20 =20 Now the question is - am I missing something here or this check should really be removed?=20 =20 Best regards, Igor Bolotin www.collarity.com =20 =20 =20 =20 =20 =20 =20 =20 ------_=_NextPart_001_01C770E2.40EA1C8F--