hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: hundreds of reads of a metafile
Date Thu, 08 Apr 2010 03:52:32 GMT
On Wed, Apr 7, 2010 at 7:49 PM, Raghu Angadi <rangadi@apache.org> wrote:
> We are working with a small HBase cluster (5 nodes) with fairly beefy nodes.
> While looking at why all the regionservers died at one time, noticed that
> these servers read some files 100s of times a second. This may not be cause
> of the error... but do you think this is odd?
>
Check end of regionserver log.  Should say why RegionServer went away.
 The usual reason is long GC pause, one that is longer than zk session
timeout.

> HBase version : 0.20.1. The cluster was handling mainly write traffic.

Can you run a more recent hbase Raghu?  Lots of fixes since 0.20.1.

> Note that in datanode log, there are a lot of reads these files.
>
> One of RS logs:
>  ---
> 2010-04-06 21:51:33,923 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
> campaign,4522\x234\x23201003,1268865840941
> 2010-04-06 21:51:34,211 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> region campaign,4522\x234\x23201003,1268865840941/407724784 available;
> sequence id is 1607026498
> 2010-04-06 21:51:43,327 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> obtain block blk_8972126557191254374_1090962 from any node:
>  java.io.IOException: No live nodes contain current block
> 2010-04-06 21:51:43,328 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> obtain block blk_-5586169098563059270_1078171 from any node:
>  java.io.IOException: No live nodes contain current block
> 2010-04-06 21:51:43,328 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> obtain block blk_-7610953303919156937_1089667 from any node:
>  java.io.IOException: No live nodes contain current block
> [...]
> ----

Are you accessing from mapreduce?  If so, does your hadoop have hdfs-127?

Then there are the usual suspects.  Xceivers count -- up it to 2k or
so -- and ulimit should be much greater than the default 1024.

>
> portion of grep for one the blocks mentioned above in datanode log :
> ----
> 39725703, offset: 0, srvID: DS-977430382-10.10.0.72-50010-1266601998020,
> blockid: blk_8972126557191254374_1090962, duration: 97000
> 2010-04-06 21:51:43,307 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.0.72:50010, dest: /10.10.0.72:43699, bytes: 6976, op: HDFS_READ,
> cliID: DFSClient_-1439725703, offset: 0, srvID:
> DS-977430382-10.10.0.72-50010-1266601998020, blockid:
> blk_8972126557191254374_1090962, duration: 76000
> 2010-04-06 21:51:43,310 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.0.72:50010, dest: /10.10.0.72:45123, bytes: 6976, op: HDFS_READ,
> cliID: DFSClient_-1439725703, offset: 0, srvID:
> DS-977430382-10.10.0.72-50010-1266601998020, blockid:
> blk_8972126557191254374_1090962, duration: 93000
> 2010-04-06 21:51:43,314 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.0.72:50010, dest: /10.10.0.72:41891, bytes: 6976, op: HDFS_READ,
> cliID: DFSClient_-1439725703, offset: 0, srvID:
> DS-977430382-10.10.0.72-50010-1266601998020, blockid:
> blk_8972126557191254374_1090962, duration: 267000
> 2010-04-06 21:51:43,318 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.0.72:50010, dest: /10.10.0.72:46412, bytes: 6976, op: HDFS_READ,
> cliID: DFSClient_-1439725703, offset: 0, srvID:
> DS-977430382-10.10.0.72-50010-1266601998020, blockid:
> blk_8972126557191254374_1090962, duration: 91000
> 2010-04-06 21:51:46,330 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.0.72:50010, dest: /10.10.0.72:40657, bytes: 6976, op: HDFS_READ,
> cliID: DFSClient_-1439725703, offset: 0, srvID:
> DS-977430382-10.10.0.72-50010-1266601998020, blockid:
> blk_8972126557191254374_1090962, duration: 85000
> ------
>
> There are thousands of repeated reads of many small files like this.
>
> --- From NN log, this block was created
> for /hbase/.META./1028785192/info/1728561479703335912
> 2010-04-06 21:51:20,906 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock: /hbase/.META./1028785192/info/1728561479703335912.
> blk_8972126557191254374_1090962
> ----
>
> Btw, we had single replication set for this file by mistake.
>

So, if any error getting a block, there is no recourse.  Was there
concurrent processes sucking i/o from HDFS running at same time?

Writing, clients need to figure where to write.  They'll do this by
doing lookup in .META.  They'll then cache the info.  If clients are
short-lived, then lots of .META. hits.

And as Ryan says, whats the caching stats look like for the .META.
region?  (See server it was hosted on and check its logs -- we dump
cache metrics every minute or so).

St.Ack

Mime
View raw message