Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of saint.ack@gmail.com
 designates 209.85.221.193 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type
         :content-transfer-encoding;
        b=Ru2H5KyJHIrUkfLACoCIrYjeBVXydXu2W1PAC1v+un6AjQsD45PfouUu20EROe77gh
         4HWJ0Mv1mvHbcfrdJW0ov8vNWCdSnsRemHZKs/lz1Mdelfc07U2UsSucaSLCtPw6qVBC
         qNaKdkk8eDFyXMWY/ogOP9iVcjeUBXa2/Dt1Y=
MIME-Version: 1.0
Sender: saint.ack@gmail.com
In-Reply-To: <r2h20d6e77d1004081426s396c9cf2s3cce273e1baf794d@mail.gmail.com>
References: <q2g20d6e77d1004071949w49d9c13bj195d1131b7db60cb@mail.gmail.com>
	 <z2p7c962aed1004072052wb6c74010t2050a3e5489b9ea1@mail.gmail.com>
	 <j2k20d6e77d1004080020x6810a92wcfb24b871a28a7f4@mail.gmail.com>
	 <s2h7c962aed1004080046rf6632b8ayda0661f497ba371b@mail.gmail.com>
	 <r2h20d6e77d1004081426s396c9cf2s3cce273e1baf794d@mail.gmail.com>
Date: Fri, 9 Apr 2010 11:41:26 -0700
Message-ID: <p2y7c962aed1004091141m91025b92v2b0894fdf96ae4f6@mail.gmail.com>
Subject: Re: hundreds of reads of a metafile
From: Stack <stack@duboce.net>
To: hbase-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Thu, Apr 8, 2010 at 2:26 PM, Raghu Angadi <rangadi@apache.org> wrote:
> Regionserver log on node 72 is at : http://bit.ly/cd9acU =A0(160K gzipped=
). To
> give a scale of reads, the local datanode had 3.5K reads spread across ab=
out
> 20 blocks. Pretty much all the reads were from the same DFSClient ID. wil=
l
> watch if this happens again.
>

Thanks for digging in Raghu.

Looking in logs, lots of churn -- other regionservers shedding
regions, restarting?  -- and write load would probably do better if
given more RAM.  You keep hitting the ceiling where we put down a gate
blocking writes until flushes finish.

What time interval are we talking of regards the 3.5k reads across 20
blocks?  Were the 20 blocks under ${HBASE_ROOTDIR}/.META. directory?
This regionserver was carrying the .META. region it looks like so its
going to be popular.  I'd think the cache should be running
interference on i/o but maybe its not doing a good job of it.  The
write load/churn might be blowing the cache.  Yeah, log at DEBUG and
we'll get a better idea.

You re doing big upload?

Cache is a config where you set how much of heap to allocate.  Default
is 0.2 IIRC.
St.Ack


> We are increasing memory from 3GB to 6GB. Any pointers about how to set s=
ize
> of block cache will be helpful.
> will enable DEBUG for LruBlockCache.
>
> Raghu.
>
> On Thu, Apr 8, 2010 at 12:46 AM, Stack <stack@duboce.net> wrote:
>
>> On Thu, Apr 8, 2010 at 12:20 AM, Raghu Angadi <rangadi@apache.org> wrote=
:
>> > Thanks Stack. will move to 20.3 or 20 trunk very soon. more responses
>> inline
>> > below.
>> >
>> Do. =A0A 0.20.4 should be around in next week or so which will be better
>> still, FYI.
>>
>> > On Wed, Apr 7, 2010 at 8:52 PM, Stack <stack@duboce.net> wrote:
>> >
>> >> On Wed, Apr 7, 2010 at 7:49 PM, Raghu Angadi <rangadi@apache.org>
>> wrote:
>> >> > We are working with a small HBase cluster (5 nodes) with fairly bee=
fy
>> >> nodes.
>> >> > While looking at why all the regionservers died at one time, notice=
d
>> that
>> >> > these servers read some files 100s of times a second. This may not =
be
>> >> cause
>> >> > of the error... but do you think this is odd?
>> >> >
>> >> Check end of regionserver log. =A0Should say why RegionServer went aw=
ay.
>> >> =A0The usual reason is long GC pause, one that is longer than zk sess=
ion
>> >> timeout.
>> >
>> >
>> > This seems to be the case... There were CMS GC failures (promotion
>> failed,
>> > Full GC etc). There were 4-5 pauses of about 4-10 seconds over a minut=
e
>> or
>> > so. Is that enough to kill ZK session? We are increasing the memory an=
d
>> will
>> > go through tuning tips on wiki.
>> >
>>
>> ZK session in your 0.20.1 is probably 40 seconds IIRC but yeah, this
>> is common enough until a bit of tuning is done. =A0If you update to
>> 0.20.3 at least, the zk session is 60 seconds but you should try and
>> avoid the promotion failures if you can.
>>
>> > There are various other errors in the log over couple of hours of RS r=
un.
>> > will post a link to the full log.
>> >
>> > --- failure on RS-72 ---
>> > 2010-04-06 22:11:07,668 WARN org.apache.zookeeper.ClientCnxn: Exceptio=
n
>> > closing session 0x127d58da4e70002 to sun.nio.ch.SelectionKeyImpl@42629=
5eb
>> > java.io.IOException: TIMED OUT
>> > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>> > 2010-04-06 22:11:07,668 WARN org.apache.zookeeper.ClientCnxn: Exceptio=
n
>> > closing session 0x27d58da6de0088 to sun.nio.ch.SelectionKeyImpl@283f46=
33
>> > java.io.IOException: TIMED OUT
>> > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>> > 2010-04-06 22:11:07,672 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>> Server
>> > handler 27 on 60020, call put([B@20a192c7,
>> > [Lorg.apache.hadoop.hbase.client.Put;@4fab578d) from 10.10.0.72:60211:
>> > error: java.io.IOException: Server not running, aborting
>> > java.io.IOException: Server not running, aborting
>> > at
>> >
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServ=
er.java:2345)
>> > ------------
>>
>>
>> These look like the zk session time out sequence.
>>
>>
>> >
>> > --- failure on RS-73 after a few minutes ---
>> >
>> > 2010-04-06 22:21:41,867 INFO
>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
>> > -4957903368956265878 lease expired
>> > 2010-04-06 22:21:47,806 WARN org.apache.zookeeper.ClientCnxn: Exceptio=
n
>> > closing session 0x127d58da4e7002a to sun.nio.ch.SelectionKeyImpl@15ef1=
241
>> > java.io.IOException: TIMED OUT
>> > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>> > 2010-04-06 22:21:47,806 ERROR
>> > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> > java.lang.OutOfMemoryError: Java heap space
>> > at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:39)
>> > at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
>> > -------
>> >
>>
>> This is zk session timeout and an OOME. =A0The GC couldn't succeed.
>>
>> How much memory you giving these puppies? =A0CMS is kinda sloppy so you
>> need to give it a bit more space to work in.
>>
>> > =A0[...]
>> >
>> >> > 2010-04-06 21:51:43,328 INFO org.apache.hadoop.hdfs.DFSClient: Coul=
d
>> not
>> >> > obtain block blk_-7610953303919156937_1089667 from any node:
>> >> > =A0java.io.IOException: No live nodes contain current block
>> >> > [...]
>> >> > ----
>> >>
>> >> Are you accessing from mapreduce? =A0If so, does your hadoop have
>> hdfs-127?
>> >>
>> >> Then there are the usual suspects. =A0Xceivers count -- up it to 2k o=
r
>> >> so -- and ulimit should be much greater than the default 1024.
>> >
>> >
>> > yes. Most of the traffic now is puts from reducers. I think HDFS is a
>> recent
>> > Cloudera release. I will check. Most likely it wont have hdfs-127.
>> >
>>
>> My guess is that it does.. but yeah, check (You should remember that
>> one -- smile)
>>
>>
>> > yup.. we hit Xceivers limit very early. The limit is 2k and fd limit i=
s
>> also
>> > high.
>> >
>> > [...]
>> >
>> >> > There are thousands of repeated reads of many small files like this=
.
>> >> >
>> >> > --- From NN log, this block was created
>> >> > for /hbase/.META./1028785192/info/1728561479703335912
>> >> > 2010-04-06 21:51:20,906 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> >> > NameSystem.allocateBlock:
>> >> /hbase/.META./1028785192/info/1728561479703335912.
>> >> > blk_8972126557191254374_1090962
>> >> > ----
>> >> >
>> >> > Btw, we had single replication set for this file by mistake.
>> >> >
>> >>
>> >> So, if any error getting a block, there is no recourse. =A0Was there
>> >> concurrent processes sucking i/o from HDFS running at same time?
>> >
>> >
>> >
>> >> Writing, clients need to figure where to write. =A0They'll do this by
>> >> doing lookup in .META. =A0They'll then cache the info. =A0If clients =
are
>> >> short-lived, then lots of .META. hits.
>> >>
>> >
>> > Client here is HBase client (in our case reducers)?
>> >
>>
>> Your reducers run for a while? =A0So clients have chance to exploit
>> cache of meta data info?
>>
>> >
>> >> And as Ryan says, whats the caching stats look like for the .META.
>> >> region? =A0(See server it was hosted on and check its logs -- we dump
>> >> cache metrics every minute or so).
>> >>
>> >
>> > What should I look for in the log for these stats? I didn't notice sta=
ts.
>> I
>> > will send link to =A0full log.
>> >
>>
>> Looks like this in regionserver log: 2010-04-08 00:43:51,822 DEBUG
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
>> Total=3D6.6769104MB (7001248), Free=3D807.12305MB (846329888), Max=3D813=
.8MB
>> (853331136), Counts: Blocks=3D0, Access=3D202589, Hit=3D0, Miss=3D202589=
,
>> Evictions=3D0, Evicted=3D0, Ratios: Hit Ratio=3D0.0%, Miss Ratio=3D100.0=
%,
>> Evicted/Run=3DNaN
>>
>> You need to enable DEBUG (I'd point you at FAQ for how to do this but
>> you know how).
>>
>> St.Ack
>>
>> > Raghu.
>> >
>> >
>> >> St.Ack
>> >>
>> >
>>
>