hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kaveh minooie <ka...@plutoz.com>
Subject Re: help why do my regionservers shut themselves down?
Date Tue, 23 Apr 2013 04:47:40 GMT
thanks everyone for responding.

No I don't have the GC logs. I don't even know how i can get that. but 
it seems that the regionserver did recovere from that and then gets into 
trouble here:

2013-04-22 16:47:56,830 INFO 
org.apache.hadoop.hbase.regionserver.HRegion: compaction interrupted by 
user:
java.io.InterruptedIOException: Aborting compaction of store f in region 
t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23. 
because user requested stop.

the part that I don't understand is what it means when it say 
"compaction interrupted by user"!

and to answer your question ted, I am using 0.90.6 over hadoop 1.1.1 ( i 
can't upgrade since gora so far only works with .90.x ) and no 
everything was normal as far as I could say the map jobs were staggering 
since, i assume, the hbase became unresponsive  ( the web interface 
start showing exception and that is how i figured out that that 
regionserver was down) , while i was restarting this one ( through the 
status command in shell ) i noticed that two more regionserver went down 
( with identicall error , the second one, not the one about GC pause ) 
but once I restarted the regionservers (using hbase-daemon.sh)  
everything went back to normal.  but this keeps happening and as a 
result i can't left my jobs unsupervised.

thanks,

On 04/22/2013 07:35 PM, Ted Yu wrote:
> Kaveh:
> What version of HBase are you using ?
> Around 2013-04-22 16:47:56, did you observe anything else happening in your
> cluster ? See below:
>
> 2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.**regionserver.HRegion:
> compaction interrupted by user:
> java.io.**InterruptedIOException: Aborting compaction of store f in region
> t1_webpage,com.pandora.www:**http/shaggy,1366670139658.**9f565d5
> da3468c0725e590dc232abc**23. because user requested stop.
>          at org.apache.hadoop.hbase.**regionserver.Store.compact(**Store.
> java:998)
>          at org.apache.hadoop.hbase.**regionserver.Store.compact(**Store.
> java:779)
>          at org.apache.hadoop.hbase.**regionserver.HRegion.**compactStores(
> HRegion.java:**776)
>
> On Mon, Apr 22, 2013 at 6:46 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> Hi Kaveh,
>>
>> the respons is maybe already displayed on the logs you sent ;)
>>
>> "This disconnect could have been caused by a network partition or a
>> long-running GC pause, either way it's recommended that you verify
>> your environment."
>>
>> Do you have GC logs? Have you tried anything to solve that?
>>
>> JM
>>
>> 2013/4/22 kaveh minooie <kaveh@plutoz.com>:
>>> Hi
>>>
>>> after a few mapreduce jobs my regionservers shut themselves down. this is
>>> the latest time that this has happened:
>>>
>>> 2013-04-22 16:47:21,843 INFO
>>>
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
>>> This client just lost it's session with ZooKeeper, trying to reconnect.
>>> 2013-04-22 16:47:21,843 FATAL
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>> server
>>> serverName=d1r1n17.prod.plutoz.com,60020,1366657358443, load=(requests=5
>>> 392, regions=196, usedHeap=1063, maxHeap=3966):
>>> regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661
>>> regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired
>> fr
>>> om ZooKeeper, aborting
>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>> KeeperErrorCode = Session expired
>>>          at
>>>
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
>>>          at
>>>
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
>>>          at
>>>
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:523)
>>>          at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:499)
>>> 2013-04-22 16:47:21,843 INFO
>>>
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
>>> Trying to reconnect to zookeeper.
>>> 2013-04-22 16:47:21,844 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
>>> requests=1794, regions=196, stores=1561, storefiles=1585,
>>> storefileIndexSize=104, memstoreSize=306, compactionQueueSize=10,
>>> flushQueueSize=0, usedHeap=1073, maxHeap=3966, blockCacheSize=661986032,
>>> blockCacheFree=169901776, blockCacheCount=7242,
>> blockCacheHitCount=910925,
>>> blockCacheMissCount=1558134, blockCacheEvictedCount=1344753,
>>> blockCacheHitRatio=36, blockCacheHitCachingRatio=40
>>> 2013-04-22 16:47:21,844 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
>>> regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661
>>> regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired
>> from
>>> ZooKeeper, aborting
>>> 2013-04-22 16:47:21,844 INFO org.apache.zookeeper.ClientCnxn: EventThread
>>> shut down
>>> 2013-04-22 16:47:21,900 WARN
>> org.apache.hadoop.hbase.regionserver.wal.HLog:
>>> Too many consecutive RollWriter requests, it's a sign of the total
>> number of
>>> live datanodes is lower than the tolerable replicas.
>>> 2013-04-22 16:47:22,341 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>> client connection, connectString=zk1:2181 sessionTimeout=180000
>>> watcher=hconnection
>>> 2013-04-22 16:47:22,357 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 1 regions
>> to
>>> close
>>> 2013-04-22 16:47:22,394 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket
>>> connection to server d1r2n2.prod.plutoz.com/10.0.0.66:2181. Will not
>> attempt
>>> to authenticate using SASL (unknown error)
>>> 2013-04-22 16:47:22,395 INFO org.apache.zookeeper.ClientCnxn: Socket
>>> connection established to d1r2n2.prod.plutoz.com/10.0.0.66:2181,
>> initiating
>>> session
>>> 2013-04-22 16:47:22,397 INFO org.apache.zookeeper.ClientCnxn: Session
>>> establishment complete on server d1r2n2.prod.plutoz.com/10.0.0.66:2181,
>>> sessionid = 0x13dd980d2abbf93, negotiated timeout = 40000
>>> 2013-04-22 16:47:22,400 INFO
>>>
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
>>> Reconnected successfully. This disconnect could have been caused by a
>>> network partition or a long-running GC pause, either way it's recommended
>>> that you verify your environment.
>>> 2013-04-22 16:47:22,400 INFO org.apache.zookeeper.ClientCnxn: EventThread
>>> shut down
>>> 2013-04-22 16:47:56,830 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> compaction interrupted by user:
>>> java.io.InterruptedIOException: Aborting compaction of store f in region
>>>
>> t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
>>> because user requested stop.
>>>          at
>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998)
>>>          at
>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779)
>>>          at
>>>
>> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776)
>>>          at
>>>
>> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721)
>>>          at
>>>
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
>>> 2013-04-22 16:47:56,830 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> aborted compaction on region
>>>
>> t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
>>> after 5mins, 58sec
>>> 2013-04-22 16:47:56,830 INFO
>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>> regionserver60020.compactor exiting
>>> 2013-04-22 16:47:56,832 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> Closed
>>>
>> t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
>>> 2013-04-22 16:47:57,363 INFO
>> org.apache.hadoop.hbase.regionserver.wal.HLog:
>>> regionserver60020.logSyncer exiting
>>> 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases:
>>> regionserver60020 closing leases
>>> 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases:
>>> regionserver60020 closed leases
>>> 2013-04-22 16:47:57,366 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
>>> exiting
>>> 2013-04-22 16:47:57,497 INFO
>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
>> starting;
>>> hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-15,5,main]
>>> 2013-04-22 16:47:57,497 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
>> hook
>>> 2013-04-22 16:47:57,497 INFO
>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown
>> hook
>>> thread.
>>> 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases:
>>> regionserver60020.leaseChecker closing leases
>>> 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases:
>>> regionserver60020.leaseChecker closed leases
>>> 2013-04-22 16:47:57,598 INFO
>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
>> finished.
>>> I would appreciate it very much if someone could explain to me what just
>>> happened here.
>>>
>>> thanks,


Mime
View raw message