Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of rohitkelkar@gmail.com
 designates 209.85.223.177 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALCVZZtjThMcveWGyuCKpGKT1fvvnn5EjfuM_A7p227ZqETA+A@mail.gmail.com>
References: 
 <CALCVZZtvXmKtVd-E050ZxSsrYdu7u0RgnaiDOtTsv4CSAUNsLQ@mail.gmail.com>
	<CALCVZZs9+fESBhLdyBH2jAf7OQMETXHp5bW9Pv1W50Fbab8yAg@mail.gmail.com>
	<CAPQV63WA1ok9EPDZNixsTs9TeNB_H7Y3B+3NoH37s+6L=sc_cw@mail.gmail.com>
	<CALCVZZtEc0ow7jrwBQ_o2uoJHUHmZ9FaZ25Fc8=os6ptas6O9A@mail.gmail.com>
	<CAPQV63UkL_Mt7w+EB0fwCQwHS3aCzbjS+7fqaJk5Bft9yHTU=w@mail.gmail.com>
	<CALCVZZtjThMcveWGyuCKpGKT1fvvnn5EjfuM_A7p227ZqETA+A@mail.gmail.com>
Date: Thu, 27 Feb 2014 11:05:39 -0600
Message-ID: 
 <CALCVZZt42h0i=omio6nNZwsi47v=Dz4NsLiUhnjtxnjPPQcLHg@mail.gmail.com>
Subject: Re: region server dead and datanode block movement error
From: Rohit Kelkar <rohitkelkar@gmail.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=047d7bfea0b6a9176904f366548e

--047d7bfea0b6a9176904f366548e
Content-Type: text/plain; charset=ISO-8859-1

Oh yes and forgot to add the ZK process
ZK = 5GB

Total = 45GB


On Thu, Feb 27, 2014 at 11:01 AM, Rohit Kelkar <rohitkelkar@gmail.com>wrote:

> Hi Jean-Marc,
>
> Each node has 48GB RAM
> To isolate and debug the RS failure issue, we have switched off all other
> tools. The only processes running are
> - DN = 4GB
> - RS = 6GB
> - TT = 4GB
> - num mappers available on the node = 4 * 4GB = 16GB
> - num reducers available on the node = 2 * 4GB = 8GB
> - 4 other java processes unrelated to hadoop/hbase = 512MB * 4 = 2GB
>
> Total = 40GB
>
>
> On Thu, Feb 27, 2014 at 10:42 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> 2014-02-21 13:36:27,496 WARN org.apache.hadoop.ipc.HBaseServer:
>> (responseTooSlow):
>> {"processingtimems":41236,"call":"next(-8680499896692404689, 1), rpc
>> version=1, client version=29, methodsFingerPrint=54742778","client":"
>> 10.0.0.96:46618
>>
>> ","starttimems":1393007746259,"queuetimems":0,"class":"HRegionServer","responsesize":6,"method":"next"}
>> 2014-02-21 13:41:00,272 WARN org.apache.hadoop.hbase.util.Sleeper: We
>> slept
>> 10193644ms instead of 10000000ms, this is likely due to a long garbage
>> collecting pause and it's usually bad, see
>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>
>> Your issue is clearly this.
>>
>> For the swap, it's not because you set swappiness that Linux will not
>> swap.
>> It will try to not swap, but if it really has to, it will.
>>
>> How many GB on your server? How many for the DN,for th RS, etc. any TT on
>> them? Any other tool? If TT, how many slots? How many GB per slots?
>>
>> JM
>>
>>
>> 2014-02-27 11:37 GMT-05:00 Rohit Kelkar <rohitkelkar@gmail.com>:
>>
>> > Hi Jean-Marc,
>> >
>> > I have updated the RS log here (http://pastebin.com/bVDvMvrB) with
>> events
>> > before 13:41:00. In the log I see a few responseTooSlow warnings at
>> > 13:34:00, 13:36:00. Then no activity till 13:41:00.
>> > At 13:41:00 there is a Sleeper warning - WARN
>> > org.apache.hadoop.hbase.util.Sleeper: We slept 10193644ms instead of
>> > 10000000ms, this is likely due to a long garbage collecting pause and
>> it's
>> > usually bad, see ...
>> > Followed by - INFO org.apache.zookeeper.ClientCnxn: Client session timed
>> > out, have not heard from server in 260409ms for sessionid
>> > 0x34432befe5417d2, closing socket connection and attempting reconnect.
>> >
>> > Looking at some of the reasons you mentioned -
>> > 1. I analyzed the GC logs for this RS. In the last 10 mins before the RS
>> > went down, the GC times are less than 1 sec. Nothing that will take
>> 260409
>> > ms as indicated above in the RS log.
>> > 2. The RS node has swappiness set to 0
>> > 3. So I think I should investigate the possibility of network issues.
>> Any
>> > pointers where I could start?
>> >
>> > - R
>> >
>> > On Thu, Feb 27, 2014 at 10:17 AM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org> wrote:
>> >
>> > > Hi Rohit,
>> > >
>> > > Usually YouAreDeadException is when your RegionServer is to slow. It
>> gets
>> > > kicked out by Master+ZK but then try to join back and get informed it
>> has
>> > > bene kicked out.
>> > >
>> > > Reasons:
>> > > - Long Gargabe Collection;
>> > > - Swapping;
>> > > - Network issues (get disconnected, then re-connected);
>> > > - etc.
>> > >
>> > > what do you have before 2014-02-21 13:41:00,308 in the logs?
>> > >
>> > >
>> > > 2014-02-27 11:13 GMT-05:00 Rohit Kelkar <rohitkelkar@gmail.com>:
>> > >
>> > > > Hi, has anybody been facing similar issues?
>> > > >
>> > > > - R
>> > > >
>> > > >
>> > > > On Wed, Feb 26, 2014 at 12:55 PM, Rohit Kelkar <
>> rohitkelkar@gmail.com
>> > > > >wrote:
>> > > >
>> > > > > We are running hbase 0.94.2 on hadoop 0.20 append version in
>> > production
>> > > > > (yes we have plans to upgrade hadoop). Its a 5 node cluster and a
>> 6th
>> > > > node
>> > > > > running just the name node and hmaster.
>> > > > > I am seeing frequent RS YouAreDeadExceptions. Logs here
>> > > > > http://pastebin.com/44aFyYZV
>> > > > > The RS log shows a DFSOutputStream ResponseProcessor exception
>>  for
>> > > block
>> > > > > blk_-6695300470410774365_837638 java.io.EOFException at 13:41:00
>> > > followed
>> > > > > by YouAreDeadException at the same time.
>> > > > > I grep'ed this block in the Datanode (see log here
>> > > > > http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception in
>> > > > > receiveBlock for block blk_-6695300470410774365_837638
>> > > > > java.nio.channels.ClosedByInterruptException.
>> > > > > I have also attached the namenode logs around the block here
>> > > > > http://pastebin.com/9NE9J8s1
>> > > > >
>> > > > > Across several RS failure instances I see the following pattern -
>> the
>> > > > > region server YouAreDeadException is always preceeded by the
>> > > EOFException
>> > > > > and datanode ClosedByInterruptException
>> > > > >
>> > > > > Is the error in the movement of the block causing the region
>> server
>> > to
>> > > > > report a YouAreDeadException? And of course, how do I solve this?
>> > > > >
>> > > > > - R
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

--047d7bfea0b6a9176904f366548e--