hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (HBASE-25) [hbase] Stuck regionserver?
Date Fri, 14 Mar 2008 03:45:24 GMT

     [ https://issues.apache.org/jira/browse/HBASE-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

stack resolved HBASE-25.

       Resolution: Invalid
    Fix Version/s: 0.1.0

On a cluster that was running lots of other heavy-duty processes concurrently, were seeing
lots of regionservers going down because could not connect to master within lease interval.
 At Jim Firby suggestion, I added logging of how long we were actually sleeping though we'd
asked sleep for 3 second only.  Last night during an upload I caught a message that said we'd
slept > 30 seconds, longer than default sleep period (See HBASE-501).  I'm guessing this
phenomeon of threads oversleeping is what we've up to this been calling 'hung server'.  Closing
as invalid.  Can reopen if the added logging does NOT account for region servers failing to
check in with master within lease period.

> [hbase] Stuck regionserver?
> ---------------------------
>                 Key: HBASE-25
>                 URL: https://issues.apache.org/jira/browse/HBASE-25
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: regionserver
>            Reporter: stack
>            Assignee: stack
>            Priority: Trivial
>             Fix For: 0.1.0
> Looking in logs, a regionserver went down because it could not contact the master after
60 seconds.  Watching logging, the HRS is repeatedly checking all 150 loaded regions over
and over again w/ a pause of about 5 seconds between runs... then there is a suspicious 60+
second gap with no logging as though the regionserver had hung up on something:
> {code}
> 2007-12-03 13:14:54,178 DEBUG hbase.HRegionServer - flushing region postlog,img151/60/plakatlepperduzy1hh7.jpg,1196614355635
> 2007-12-03 13:14:54,178 DEBUG hbase.HRegion - Not flushing cache for region postlog,img151/60/plakatlepperduzy1hh7.jpg,1196614355635:
snapshotMemcaches() determined that there was nothing to do
> 2007-12-03 13:14:54,205 DEBUG hbase.HRegionServer - flushing region postlog,img247/230/seanpaul4li.jpg,1196615889965
> 2007-12-03 13:14:54,205 DEBUG hbase.HRegion - Not flushing cache for region postlog,img247/230/seanpaul4li.jpg,1196615889965:
snapshotMemcaches() determined that there was nothing to do
> 2007-12-03 13:16:04,305 FATAL hbase.HRegionServer - unable to report to master for 67467
milliseconds - aborting server
> 2007-12-03 13:16:04,455 INFO  hbase.Leases - regionserver/0:0:0:0:0:0:0:0:60020 closing
> 2007-12-03 13:16:04,455 INFO  hbase.Leases$LeaseMonitor - regionserver/0:0:0:0:0:0:0:0:60020.leaseChecker
> {code}
> Master seems to be running fine scanning its ~700 regions.  Then you see this in log,
before the HRS shuts itself down.
> {code}
> 2007-12-03 13:14:31,416 INFO  hbase.Leases - HMaster.leaseChecker lease expired 153260899/1532608992007-12-03
13:14:31,417 INFO  hbase.HMaster - XX.XX.XX.102:60020 lease expired
> {code}
> ... and we go on to process shutdown.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message