Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of saint.ack@gmail.com
 designates 209.85.212.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type
         :content-transfer-encoding;
        b=sCJmYKZmOJA+p8DQpBE9OilpAqL0XoXKBIlpvDbj0RFpQIS1FmB3XzAe4il3pehQ5e
         dNup54VWCekdvXcmSe7K4Cc41xNxNjPWb2HwEOUinv5S76Bok6qUlgOg/3eZicQEe9b1
         Arkl+Fpwt5Y4rI/jNlb+peZClWYAn/mA3c7U0=
MIME-Version: 1.0
Sender: saint.ack@gmail.com
In-Reply-To: <C7B42F43.FD10%rod.cope@openlogic.com>
References: <C7B42F43.FD10%rod.cope@openlogic.com>
Date: Wed, 3 Mar 2010 22:30:41 -0800
Message-ID: <7c962aed1003032230r469fdb1fg3efc831a585e9756@mail.gmail.com>
Subject: Re: Neverending DataNode.clienttrace logging after starting Hbase
From: Stack <stack@duboce.net>
To: hbase-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

This looks like the product of the below addtion:

------------------------------------------------------------------------
r687868 | cdouglas | 2008-08-21 14:27:31 -0700 (Thu, 21 Aug 2008) | 3 lines

HADOOP-3062. Add metrics to DataNode and TaskTracker to record network
traffic for HDFS reads/writes and MR shuffling.


Your problem is the quantity of the traces is overwhelming?  If so, we
should make a patch for 0.20 hadoop to turn off the above tracing when
not wanted.

St.Ack


On Wed, Mar 3, 2010 at 2:22 PM, Rod Cope <rod.cope@openlogic.com> wrote:
> We have a problem on our Hadoop cluster where a random subset of our
> datanode logs are filling up with GB=B9s of clienttrace messages after
> starting HBase. =A0The logs are fine, fsck shows a perfect report, and al=
l is
> well until HBase is started. =A0Without running any MR jobs or using
> HBase/HDFS, we=B9ve let it go for 15+ hours without seeing any change in
> logging behavior. =A0The datanode logs look like this:
>
> ...
> 2010-03-03 09:50:16,738 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47445, bytes: 23710, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_3353532406500894348_402571, duration: 93000
> 2010-03-03 09:50:16,739 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47446, bytes: 226458, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_1719430407401031560_402554, duration: 256000
> 2010-03-03 09:50:16,743 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47449, bytes: 5073, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 16384, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_-4376123158716665125_402579, duration: 106000
> 2010-03-03 09:50:16,743 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47450, bytes: 7974, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 24576, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_-4354660273604489760_402576, duration: 118000
> 2010-03-03 09:50:16,744 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47451, bytes: 23710, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_3353532406500894348_402571, duration: 88000
> 2010-03-03 09:50:16,745 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47452, bytes: 226458, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_1719430407401031560_402554, duration: 253000
> 2010-03-03 09:50:16,749 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47455, bytes: 5073, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 16384, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_-4376123158716665125_402579, duration: 81000
> 2010-03-03 09:50:16,749 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47456, bytes: 7974, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 24576, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_-4354660273604489760_402576, duration: 76000
> 2010-03-03 09:50:16,750 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47457, bytes: 23710, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_3353532406500894348_402571, duration: 91000
> 2010-03-03 09:50:16,750 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47458, bytes: 226458, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_1719430407401031560_402554, duration: 237000
> ...
>
> The logs (and top and tcptrack) show bursts of activity for perhaps 15
> seconds at a time, then it stops for 30 seconds or so and starts all over
> again. =A0Frequently, there will be a line or two like this after each bu=
rst:
> 2010-03-03 09:52:35,255 INFO
> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> succeeded for blk_8884930029593424313_289042
>
> When we restart the cluster, the problem usually moves to a different ran=
dom
> subset of nodes. =A0This has happened at least 5 times and there=B9s no w=
ay
> we=B9ve been able to predict which machine(s) will have this weird loggin=
g
> behavior.
>
> I=B9ve noticed that the cliID in the logs is identical across the machine=
s
> exhibiting this behavior. =A0Is there a way to tell which client that is?=
 =A0As
> I said, we don=B9t have any MR jobs running.
>
> We=B9re on the latest CDH test release (February 9, 0.20.1+169.56), which=
 is
> pretty close to Hadoop 0.20.2. =A0The data in HDFS is almost entirely HBa=
se
> (0.20.3), but there are a few other blocks of data. =A0The problems just
> started in the last few days. =A0We=B9ve recently gone through some very =
heavy
> data loading into HBase, but hadn=B9t seen this problem in the past when =
we=B9ve
> done the same thing.
>
> The data nodes and regionservers stay up without any errors and I can run
> counts from the shell and Hadoop jobs that read every row of my largest
> table without issues, but the burst of logging activity is accompanied by=
 a
> spike of CPU, network, and disk that impacts performance and I=B9m afraid=
 it
> will cause timeouts or other problems. =A0We=B9re also getting datanode l=
og
> files that run into several GB=B9s a day.
>
> Any ideas?
>
> Thanks,
> Rod
>