Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 15347 invoked from network); 4 Mar 2010 06:31:20 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 4 Mar 2010 06:31:20 -0000 Received: (qmail 23325 invoked by uid 500); 4 Mar 2010 06:31:10 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 23202 invoked by uid 500); 4 Mar 2010 06:31:09 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 23193 invoked by uid 99); 4 Mar 2010 06:31:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Mar 2010 06:31:09 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of saint.ack@gmail.com designates 209.85.212.48 as permitted sender) Received: from [209.85.212.48] (HELO mail-vw0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Mar 2010 06:31:02 +0000 Received: by vws8 with SMTP id 8so701227vws.35 for ; Wed, 03 Mar 2010 22:30:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type:content-transfer-encoding; bh=R8b2fNZJ3KVhMy1ERN+np78JjHlz9QZZZirTRQ7kJMY=; b=TM88J+LBG6Wm9Gqv1v0+rz7vyTFjCtn3zqaCb+UZCTQLnUMIV8rCtfHmFausfcysGk 6AL5VdOtWmdRSgxI9XWYETYY2BUHnJX1exzN1CU1TaxascKha4PIoSyyGI2W9iEQJ3XT wFB8geZZA8ROEgzM072wc/Rf/XK59mrQlS7YE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=sCJmYKZmOJA+p8DQpBE9OilpAqL0XoXKBIlpvDbj0RFpQIS1FmB3XzAe4il3pehQ5e dNup54VWCekdvXcmSe7K4Cc41xNxNjPWb2HwEOUinv5S76Bok6qUlgOg/3eZicQEe9b1 Arkl+Fpwt5Y4rI/jNlb+peZClWYAn/mA3c7U0= MIME-Version: 1.0 Sender: saint.ack@gmail.com Received: by 10.229.131.39 with SMTP id v39mr1219190qcs.66.1267684241888; Wed, 03 Mar 2010 22:30:41 -0800 (PST) In-Reply-To: References: Date: Wed, 3 Mar 2010 22:30:41 -0800 X-Google-Sender-Auth: 2a6e6628741d38bc Message-ID: <7c962aed1003032230r469fdb1fg3efc831a585e9756@mail.gmail.com> Subject: Re: Neverending DataNode.clienttrace logging after starting Hbase From: Stack To: hbase-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable This looks like the product of the below addtion: ------------------------------------------------------------------------ r687868 | cdouglas | 2008-08-21 14:27:31 -0700 (Thu, 21 Aug 2008) | 3 lines HADOOP-3062. Add metrics to DataNode and TaskTracker to record network traffic for HDFS reads/writes and MR shuffling. Your problem is the quantity of the traces is overwhelming? If so, we should make a patch for 0.20 hadoop to turn off the above tracing when not wanted. St.Ack On Wed, Mar 3, 2010 at 2:22 PM, Rod Cope wrote: > We have a problem on our Hadoop cluster where a random subset of our > datanode logs are filling up with GB=B9s of clienttrace messages after > starting HBase. =A0The logs are fine, fsck shows a perfect report, and al= l is > well until HBase is started. =A0Without running any MR jobs or using > HBase/HDFS, we=B9ve let it go for 15+ hours without seeing any change in > logging behavior. =A0The datanode logs look like this: > > ... > 2010-03-03 09:50:16,738 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.60.106:50010, dest: /192.168.60.106:47445, bytes: 23710, op: > HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID: > DS-35570150-192.168.60.106-50010-1267000204488, blockid: > blk_3353532406500894348_402571, duration: 93000 > 2010-03-03 09:50:16,739 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.60.106:50010, dest: /192.168.60.106:47446, bytes: 226458, op: > HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID: > DS-35570150-192.168.60.106-50010-1267000204488, blockid: > blk_1719430407401031560_402554, duration: 256000 > 2010-03-03 09:50:16,743 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.60.106:50010, dest: /192.168.60.106:47449, bytes: 5073, op: > HDFS_READ, cliID: DFSClient_1676127095, offset: 16384, srvID: > DS-35570150-192.168.60.106-50010-1267000204488, blockid: > blk_-4376123158716665125_402579, duration: 106000 > 2010-03-03 09:50:16,743 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.60.106:50010, dest: /192.168.60.106:47450, bytes: 7974, op: > HDFS_READ, cliID: DFSClient_1676127095, offset: 24576, srvID: > DS-35570150-192.168.60.106-50010-1267000204488, blockid: > blk_-4354660273604489760_402576, duration: 118000 > 2010-03-03 09:50:16,744 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.60.106:50010, dest: /192.168.60.106:47451, bytes: 23710, op: > HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID: > DS-35570150-192.168.60.106-50010-1267000204488, blockid: > blk_3353532406500894348_402571, duration: 88000 > 2010-03-03 09:50:16,745 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.60.106:50010, dest: /192.168.60.106:47452, bytes: 226458, op: > HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID: > DS-35570150-192.168.60.106-50010-1267000204488, blockid: > blk_1719430407401031560_402554, duration: 253000 > 2010-03-03 09:50:16,749 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.60.106:50010, dest: /192.168.60.106:47455, bytes: 5073, op: > HDFS_READ, cliID: DFSClient_1676127095, offset: 16384, srvID: > DS-35570150-192.168.60.106-50010-1267000204488, blockid: > blk_-4376123158716665125_402579, duration: 81000 > 2010-03-03 09:50:16,749 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.60.106:50010, dest: /192.168.60.106:47456, bytes: 7974, op: > HDFS_READ, cliID: DFSClient_1676127095, offset: 24576, srvID: > DS-35570150-192.168.60.106-50010-1267000204488, blockid: > blk_-4354660273604489760_402576, duration: 76000 > 2010-03-03 09:50:16,750 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.60.106:50010, dest: /192.168.60.106:47457, bytes: 23710, op: > HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID: > DS-35570150-192.168.60.106-50010-1267000204488, blockid: > blk_3353532406500894348_402571, duration: 91000 > 2010-03-03 09:50:16,750 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.60.106:50010, dest: /192.168.60.106:47458, bytes: 226458, op: > HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID: > DS-35570150-192.168.60.106-50010-1267000204488, blockid: > blk_1719430407401031560_402554, duration: 237000 > ... > > The logs (and top and tcptrack) show bursts of activity for perhaps 15 > seconds at a time, then it stops for 30 seconds or so and starts all over > again. =A0Frequently, there will be a line or two like this after each bu= rst: > 2010-03-03 09:52:35,255 INFO > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification > succeeded for blk_8884930029593424313_289042 > > When we restart the cluster, the problem usually moves to a different ran= dom > subset of nodes. =A0This has happened at least 5 times and there=B9s no w= ay > we=B9ve been able to predict which machine(s) will have this weird loggin= g > behavior. > > I=B9ve noticed that the cliID in the logs is identical across the machine= s > exhibiting this behavior. =A0Is there a way to tell which client that is?= =A0As > I said, we don=B9t have any MR jobs running. > > We=B9re on the latest CDH test release (February 9, 0.20.1+169.56), which= is > pretty close to Hadoop 0.20.2. =A0The data in HDFS is almost entirely HBa= se > (0.20.3), but there are a few other blocks of data. =A0The problems just > started in the last few days. =A0We=B9ve recently gone through some very = heavy > data loading into HBase, but hadn=B9t seen this problem in the past when = we=B9ve > done the same thing. > > The data nodes and regionservers stay up without any errors and I can run > counts from the shell and Hadoop jobs that read every row of my largest > table without issues, but the burst of logging activity is accompanied by= a > spike of CPU, network, and disk that impacts performance and I=B9m afraid= it > will cause timeouts or other problems. =A0We=B9re also getting datanode l= og > files that run into several GB=B9s a day. > > Any ideas? > > Thanks, > Rod >