hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: Expired Scanner Lease == RegionServer death ?
Date Mon, 30 Nov 2009 18:11:41 GMT
Awesome, Kudos on actually monitoring your environment! It makes these 
issues so much easier to track down when users have (at least) basic 
historical information about their enviroment such as host level 
information for simple stats like cpu/memory/disk information.

Checkout this ZK page for some other troubleshooting details that can 
help when issues such as those reported here are found:
http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

Regards and good luck!

Patrick

Joost Ouwerkerk wrote:
> Ryan, you made me rethink my original hypothesis.  After looking ganglia
> history, it's clear that the nodes in question were starved of memory,
> swapping like crazy.  The expired scanner lease, the region shutting down,
> and as you noted, the Zookeeper session expiry, were not a causal chain, but
> all the result of the machine grinding to a halt from swapping.  The
> MapReduce tasks were allocated too much memory, and an apparent memory leak
> in the job we were running was causing the tasks to eat into the
> RegionServer's share of the machine's memory.  I've reduced the memory
> allocated to tasks in hadoop's "mapred.child.java.opts" to ensure that the
> HADOOP_HEAPSIZE + total maximum memory allocated to tasks + the
> HBASE_HEAPSIZE is not greater than the memory available on the machine.
> 
> On Thu, Nov 26, 2009 at 3:59 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
> 
>> Hey,
>>
>> Looking at the log below, it looks like you have run into a Zookeeper
>> session expiry event.  If a RS loses it's ZK lease, it must give up,
>> since the regions it has open have probably been reassigned already,
>> and dual assignment Is Very Bad.  This is probably a GC related pause,
>> this line in particular is indicative:
>>
>>>  2009-11-26 01:53:09,867 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>> slept 72462ms, ten times longer than scheduled: 3000
>> In the log somewhere there should be a "session expired" zookeeper
>> message.  I'm not seeing it in the snippets you post.
>>
>> There are several potential problems here:
>> - You need to be using the CMS GC. This is very important.
>> - If you are hitting your cluster hard with a job and maxing out the
>> IOW, you might be starving the ZK quorum of the IO bandwidth it needs
>> to do operations.
>>
>> Can you tell us more about how you are running things?  What JVM, GC
>> config, node count, node spec, etc.
>>
>> -ryan
>>
>> On Thu, Nov 26, 2009 at 8:30 AM, Joost Ouwerkerk <joost@openplaces.org>
>> wrote:
>>> I'm experiencing a scenario whereby long-running MapReduce tasks appear
>> to
>>> be taking RegionServers down, wondering if
>>>  1) anyone else is experiencing this
>>>  2) this is expected behaviour, and if so,
>>>  3) what I can do about it.
>>>
>>> Scenario looks something like this (although I may be misinterpreting):
>>>
>>> 1.TableMapper task has not completed after about one minute
>>>
>>> 2. RegionServer reports:
>>>  2009-11-26 01:53:09,867 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
>>> 637450415407903410 lease expired
>>>  2009-11-26 01:53:09,902 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
>>> -2878823706426617596 lease expired
>>>  2009-11-26 01:53:09,902 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
>>> -4783927291358135864 lease expired
>>>  2009-11-26 01:53:09,902 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
>>> 5282836853577726911 lease expired
>>>
>>> Followed By:
>>>  2009-11-26 01:53:09,882 WARN org.apache.zookeeper.ClientCnxn: Exception
>>> closing session 0x252df4a91a0005 to sun.nio.ch.SelectionKeyImpl@49e808ca
>>>    java.io.IOException: TIMED OUT
>>>        at
>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>>>  2009-11-26 01:53:09,867 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>> slept 72462ms, ten times longer than scheduled: 3000
>>>  2009-11-26 01:53:09,918 ERROR
>>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>>>    org.apache.hadoop.hbase.UnknownScannerException: Name:
>>> 5282836853577726911
>>>        at
>>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1925)
>>> 3. Task Fails. (normal behaviour)
>>>
>>> 4. RegionServer starts closing its regions:
>>>  2009-11-26 01:53:10,035 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
>>> state: Disconnected, type: None, path: null
>>>  2009-11-26 01:53:10,104 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>> MSG_CALL_SERVER_STARTUP
>>>  2009-11-26 01:53:10,187 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> Closed crawled_pages, ...
>>>  2009-11-26 01:53:11,653 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> Closed crawled_pages, ...
>>>  2009-11-26 01:53:14,184 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> Closed crawled_pages, ...
>>>  2009-11-26 01:53:18,567 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> Closed crawled_pages, ...
>>>  2009-11-26 01:53:19,147 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> Closed crawled_pages, ...
>>>  2009-11-26 01:53:21,265 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> Closed crawled_pages, ...
>>>  2009-11-26 01:53:45,112 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> Closed crawled_pages, ...
>>>
>>> 5. RegionServer shuts down:
>>>  2009-11-26 01:53:12,425 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
>>> server on 60020
>>>  2009-11-26 01:53:12,425 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>> Server
>>> handler 1 on 60020: exiting
>>>  2009-11-26 01:53:12,425 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>> Server
>>> handler 0 on 60020: exiting
>>>  2009-11-26 01:53:12,425 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>> Server
>>> handler 2 on 60020: exiting
>>>  2009-11-26 01:53:12,443 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
>>>  2009-11-26 01:53:12,526 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
>>> IPC Server Responder
>>>  2009-11-26 01:53:13,169 INFO
>>> org.apache.hadoop.hbase.regionserver.LogFlusher:
>>> regionserver/192.168.1.111:60020.logFlusher exiting
>>>  2009-11-26 01:53:13,170 INFO
>>> org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
>>>  2009-11-26 01:53:13,170 INFO
>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
>>> regionserver/192.168.1.111:60020.cacheFlusher exiting
>>>  2009-11-26 01:53:13,170 INFO
>>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker:
>>> regionserver/192.168.1.111:60020.majorCompactionChecker exiting
>>>  2009-11-26 01:53:13,169 INFO
>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>> regionserver/192.168.1.111:60020.compactor exiting
>>>  2009-11-26 01:53:13,449 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
>>>  2009-11-26 01:53:13,450 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server at:
>>> 192.168.1.111:60020
>>>  2009-11-26 01:53:19,867 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: worker thread exiting
>>>  2009-11-26 01:53:19,867 INFO org.apache.zookeeper.ZooKeeper: Closing
>>> session: 0x252df4a91a0005
>>>  2009-11-26 01:53:19,867 INFO org.apache.zookeeper.ClientCnxn: Closing
>>> ClientCnxn for session: 0x252df4a91a0005
>>>  2009-11-26 01:53:19,867 INFO org.apache.zookeeper.ClientCnxn:
>>> Disconnecting ClientCnxn for session: 0x252df4a91a0005
>>>  2009-11-26 01:53:19,868 INFO org.apache.zookeeper.ClientCnxn:
>> EventThread
>>> shut down
>>>  2009-11-26 01:53:19,868 INFO org.apache.zookeeper.ZooKeeper: Session:
>>> 0x252df4a91a0005 closed
>>>  2009-11-26 01:53:19,902 INFO org.apache.hadoop.hbase.Leases:
>>> regionserver/192.168.1.111:60020.leaseChecker closing leases
>>>  2009-11-26 01:53:19,903 INFO org.apache.hadoop.hbase.Leases:
>>> regionserver/192.168.1.111:60020.leaseChecker closed leases
>>>  2009-11-26 01:53:19,969 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/
>>> 192.168.1.111:60020 exiting
>>>  2009-11-26 01:53:19,992 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown
>>> thread.
>>>  2009-11-26 01:53:19,992 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread
>> complete
>>> Joost
>>>
> 

Mime
View raw message