hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Geoff Hendrey" <ghend...@decarta.com>
Subject RE: scanner deadlock?
Date Tue, 13 Sep 2011 16:49:21 GMT
Thanks Stack - 

Answers to all your questions below. My current working theory is that
too many sockets are in CLOSE_WAIT state (leading to
ClosedChannelException?). We're going to try to adjust some OS
parameters.

" I'm asking if regionservers are bottlenecking on a single network
resource; a particular datanode, dns?"

Gotcha. I'm gathering some tools now to collect and analyze netstat
output.

" the regionserver is going slow getting data out of
hdfs.  Whats iowait like at the time of slowness?  Has it changed from
when all was running nicely?"

iowait is high (20% above cpu), but not increasing. I'll try to quantify
that better.

" You talk to hbase in the reducer?   Reducers don't start writing hbase
until job is 66% complete IIRC.    Perhaps its slowing as soon as it
starts writing hbase?  Is that so?"

My statement about "running fine" applies to after the reducer has
completed sort. We have metrics produced by the reducer that log the
results of scans ant Puts. So we know that scans and puts proceed
without issue for hours.

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
Stack
Sent: Tuesday, September 13, 2011 9:32 AM
To: user@hbase.apache.org
Cc: Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd
Subject: Re: scanner deadlock?

On Tue, Sep 13, 2011 at 8:20 AM, Geoff Hendrey <ghendrey@decarta.com>
wrote:
> ...but we don't have a slow region server.

I'm asking if regionservers are bottlenecking on a single network
resource; a particular datanode, dns?

Things hum along just fine.
> Suddenly, at roughly the same time, all the region servers begin
giving
> ScannerTimeoutException and ClosedChannel exception.

So, odd that its running fine then cluster slows.

>From the stacktraces you showed me -- and you might want to check
again and do a few stack traces to see that we are stuck trying to get
data from hdfs -- the regionserver is going slow getting data out of
hdfs.  Whats iowait like at the time of slowness?  Has it changed from
when all was running nicely?


> All the region
> servers are loaded in a pretty identical way by this MR job I am
> running. And they all begin showing the same error, at the same time,
> after performing perfectly for ~40% of the MR job.
>

You talk to hbase in the reducer?   Reducers don't start writing hbase
until job is 66% complete IIRC.    Perhaps its slowing as soon as it
starts writing hbase?  Is that so?

> We have an ops team that monitors all these systems with Nagios.
They've
> reviewed dmsg, and many other low level details which are over my
head.
> In the past they've adjusted MTU's, and unbounded the network cards
(we
> saw some network stack lockups in the past, etc.). I'm going to meet
> with them again, and see if we can put setup some more specific
> monitoring around this job, which we can basically view as a test
> harness.
>

OK.  Hopefully these lads can help


> That said, is there any condition that should cause HBase to get a
> ClosedChannelException, and *not* tell the zookeeper that it is
> effectively dead?


Well, sounds like regionserver is not dead.  Its just crawling so its
still 'alive'.

St.Ack

Mime
View raw message