hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Occasional regionserver crashes following socket errors writing to HDFS
Date Thu, 24 May 2012 12:13:22 GMT
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A8
<property>
    <name>zookeeper.session.timeout</name>
    <value>1200000</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.tickTime</name>
    <value>6000</value>
  </property>
The default is 60 seconds which you reduced to 20.  (Assuming this is the right parameter)

As you said you were doing a major compaction at the time. 


On May 24, 2012, at 6:15 AM, Eran Kutner wrote:

> Thanks Stack for noticing the ZooKeeper timeout, don't know how could I
> have missed that.
> 
> After analyzing this for a while it is definitely unrelated to GC. In fact
> during the last 4 days no GC operation took more than 2 seconds, and those
> that got close were all concurrent mark sweeps, so they should not be
> stopping other threads.
> 
> These are the interesting log lines:
> 2012-05-22 01:25:11,502 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 23706ms for sessionid
> 0x1372aa57bee0308, closing socket connection and attempting reconnect
> 2012-05-22 01:25:11,502 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 24638ms for sessionid
> 0x3372bf3891304bf, closing socket connection and attempting reconnect
> 2012-05-22 01:25:12,047 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server hadoop1-zk1/10.1.104.201:2181
> 2012-05-22 01:25:12,048 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to hadoop1-zk1/10.1.104.201:2181, initiating session
> 2012-05-22 01:25:12,080 INFO org.apache.zookeeper.ClientCnxn: Unable to
> reconnect to ZooKeeper service, session 0x3372bf3891304bf has expired,
> closing socket connection
> 2012-05-22 01:25:12,081 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=hadoop1-s05.farm-ny.gigya.com,60020,1336990798475,
> load=(requests=4015, regions=708, usedHeap=2342, maxHeap=7983):
> regionserver:60020-0x3372bf3891304bf regionserver:60020-0x3372bf3891304bf
> received expired from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
>        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
>        at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
>        at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> 
> This is what the zookeeper logs show at the same time:
> 2012-05-22 01:24:46,014 - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to
> read additional data from client sessionid 0x1372aa57bef6611, likely client
> has closed socket
> 2012-05-22 01:24:46,014 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
> client /10.1.104.4:57598 which had sessionid 0x1372aa57bef6611
> 2012-05-22 01:25:08,010 - ERROR [CommitProcessor:1:NIOServerCnxn@445] -
> Unexpected Exception:
> 2012-05-22 01:25:08,016 - INFO  [CommitProcessor:1:NIOServerCnxn@1435] -
> Closed socket connection for client /10.1.104.5:33945 which had sessionid
> 0x1372aa57bee0308
> 2012-05-22 01:25:12,046 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
> connection from /10.1.104.5:43070
> 2012-05-22 01:25:12,076 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew
> session 0x3372bf3891304bf at /10.1.104.5:43070
> 2012-05-22 01:25:12,076 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:Learner@103] - Revalidating client: 231702230809642175
> 2012-05-22 01:25:12,077 - INFO
> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1573] - Invalid session
> 0x3372bf3891304bf for client /10.1.104.5:43070, probably expired
> 2012-05-22 01:25:12,078 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
> client /10.1.104.5:43070 which had sessionid 0x3372bf3891304bf
> 
> I have zookeeper.session.timeout set to 20 seconds because I wanted quick
> recovery in case of a failure.
> 
> Any idea why it would not respond in 20 seconds? Seems like quite a lot of
> time.
> Don't know if it's related or not but major compaction was happening while
> this error occurred.
> 
> Thanks.
> 
> -eran
> 
> 
> 
> On Fri, May 11, 2012 at 2:36 PM, Michael Segel <michael_segel@hotmail.com>wrote:
> 
>> So I see you're looking at Eran's problem.... ;-)
>> 
>> Since you say he's fairly capable, I'm assuming when he said he had GC and
>> MSLABS set up, he did it right, so a GC pause wouldn't cause the error.
>> 
>> Bad node? possible.  It could easily be a networking/hardware issue which
>> are pain in the ass problems to track down and solve.
>> 
>> With respect to the dfs.bandwidthPerSec... yes its an HDFS setting.  As
>> you point out, its an indirect issue. However that doesn't mean it wouldn't
>> have an impact on performance.
>> 
>> OP states that this occurs under heavy writes. What happens to the writes
>> when a table is splitting?
>> 
>> 
>> On May 11, 2012, at 12:12 AM, Stack wrote:
>> 
>>> On Thu, May 10, 2012 at 6:26 AM, Michael Segel
>>> <michael_segel@hotmail.com> wrote:.
>>>> 4) google dfs.balance.bandwidthPerSec  I believe its also used by HBase
>> when they need to move regions.
>>> 
>>> Nah.  This is an hdfs setting.  HBase don't use it directly.
>>> 
>>>> Speaking of which what happens when HBase decides to move a region?
>> Does it make a copy on the new RS and then after its there, point to the
>> new RS and then remove the old region?
>>>> 
>>> 
>>> When one RS closes the region and another opens it, there is no copy
>>> to be done since the region data is in the HDFS they both share.
>>> 
>>> St.Ack
>>> 
>> 
>> 


Mime
View raw message