hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Neverending DataNode.clienttrace logging after starting Hbase
Date Thu, 04 Mar 2010 17:33:01 GMT
On Thu, Mar 4, 2010 at 6:39 AM, Rod Cope <rod.cope@openlogic.com> wrote:
> St.Ack,
>
> I could turn off the logging through log4j.properties,


As part of our soon move to hadoop 0.20.2, we should include a note on
how to do this and recommend that folks indeed do this since its so
chatty.

but I'm more
> concerned about the fact that the logging never stops or slows down even
> when the system has been idle for 20+ hours.
>

Even on a quiescent system, there are a few threads running in hbase
running checks.  For example, in the master, a thread runs each minute
checking the -ROOT- and .META. tables.  At a minimum its a scan of
each of the tables.  That'll show as clienttrace reads.

You could try correlating requests in the UI with logging in the datanode?

> Is there a reason why HDFS would be so active when no nodes have been added,
> removed, or changed and "hadoop fsck /" has been showing a perfect report
> since the cluster was started?  Does that fact that restarting the cluster
> causes the logging behavior to move to a different random subset of nodes
> provide any clues?
>

Probably because a different regionserver is now hosting .META. and
its closest replica is now at a different datanode?  Could that be
whats going on?

> Also, this behavior only started recently after doing some heavy HBase
> loads.  And if I stop HBase, this HDFS activity also stops - yet there's no
> strange activity or any exceptions in the regionserver logs.
>

I'd think you'd have seen these logs as soon as you installed an hdfs
that had HADOOP-3062.


> Are there any other logs I should look in or config I should check?  I'd
> also be happy to shut down the cluster, copy the /hbase directory and
> restart, or similar if you think that might help.
>

Try doing correlation between UI requests and datanode logging as
suggested above.  Start here I'd say.

Yours,
St.Ack

> Thanks,
> Rod
>
>
> On 3/3/10 Wednesday, March 3, 201011:30 PM, "Stack" <stack@duboce.net>
> wrote:
>
>> This looks like the product of the below addtion:
>>
>> ------------------------------------------------------------------------
>> r687868 | cdouglas | 2008-08-21 14:27:31 -0700 (Thu, 21 Aug 2008) | 3 lines
>>
>> HADOOP-3062. Add metrics to DataNode and TaskTracker to record network
>> traffic for HDFS reads/writes and MR shuffling.
>>
>>
>> Your problem is the quantity of the traces is overwhelming?  If so, we
>> should make a patch for 0.20 hadoop to turn off the above tracing when
>> not wanted.
>>
>> St.Ack
>>
>>
>>
>> On Wed, Mar 3, 2010 at 2:22 PM, Rod Cope <rod.cope@openlogic.com> wrote:
>>> We have a problem on our Hadoop cluster where a random subset of our
>>> datanode logs are filling up with GB¹s of clienttrace messages after
>>> starting HBase.  The logs are fine, fsck shows a perfect report, and all is
>>> well until HBase is started.  Without running any MR jobs or using
>>> HBase/HDFS, we¹ve let it go for 15+ hours without seeing any change in
>>> logging behavior.  The datanode logs look like this:
>>>
>>> ...
>>> 2010-03-03 09:50:16,738 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47445, bytes: 23710, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_3353532406500894348_402571, duration: 93000
>>> 2010-03-03 09:50:16,739 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47446, bytes: 226458, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_1719430407401031560_402554, duration: 256000
>>> 2010-03-03 09:50:16,743 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47449, bytes: 5073, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 16384, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_-4376123158716665125_402579, duration: 106000
>>> 2010-03-03 09:50:16,743 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47450, bytes: 7974, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 24576, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_-4354660273604489760_402576, duration: 118000
>>> 2010-03-03 09:50:16,744 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47451, bytes: 23710, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_3353532406500894348_402571, duration: 88000
>>> 2010-03-03 09:50:16,745 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47452, bytes: 226458, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_1719430407401031560_402554, duration: 253000
>>> 2010-03-03 09:50:16,749 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47455, bytes: 5073, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 16384, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_-4376123158716665125_402579, duration: 81000
>>> 2010-03-03 09:50:16,749 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47456, bytes: 7974, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 24576, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_-4354660273604489760_402576, duration: 76000
>>> 2010-03-03 09:50:16,750 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47457, bytes: 23710, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_3353532406500894348_402571, duration: 91000
>>> 2010-03-03 09:50:16,750 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47458, bytes: 226458, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_1719430407401031560_402554, duration: 237000
>>> ...
>>>
>>> The logs (and top and tcptrack) show bursts of activity for perhaps 15
>>> seconds at a time, then it stops for 30 seconds or so and starts all over
>>> again.  Frequently, there will be a line or two like this after each burst:
>>> 2010-03-03 09:52:35,255 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
>>> succeeded for blk_8884930029593424313_289042
>>>
>>> When we restart the cluster, the problem usually moves to a different random
>>> subset of nodes.  This has happened at least 5 times and there¹s no way
>>> we¹ve been able to predict which machine(s) will have this weird logging
>>> behavior.
>>>
>>> I¹ve noticed that the cliID in the logs is identical across the machines
>>> exhibiting this behavior.  Is there a way to tell which client that is?  As
>>> I said, we don¹t have any MR jobs running.
>>>
>>> We¹re on the latest CDH test release (February 9, 0.20.1+169.56), which is
>>> pretty close to Hadoop 0.20.2.  The data in HDFS is almost entirely HBase
>>> (0.20.3), but there are a few other blocks of data.  The problems just
>>> started in the last few days.  We¹ve recently gone through some very heavy
>>> data loading into HBase, but hadn¹t seen this problem in the past when we¹ve
>>> done the same thing.
>>>
>>> The data nodes and regionservers stay up without any errors and I can run
>>> counts from the shell and Hadoop jobs that read every row of my largest
>>> table without issues, but the burst of logging activity is accompanied by a
>>> spike of CPU, network, and disk that impacts performance and I¹m afraid it
>>> will cause timeouts or other problems.  We¹re also getting datanode log
>>> files that run into several GB¹s a day.
>>>
>>> Any ideas?
>>>
>>> Thanks,
>>> Rod
>>>
>
>
>
>
>
>
>

Mime
View raw message