hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Created: (HBASE-2617) Load balancer falls into pathological state if one server under average - slop; endless churn
Date Wed, 26 May 2010 23:27:39 GMT
Load balancer falls into pathological state if one server under average - slop; endless churn
---------------------------------------------------------------------------------------------

                 Key: HBASE-2617
                 URL: https://issues.apache.org/jira/browse/HBASE-2617
             Project: HBase
          Issue Type: Bug
            Reporter: stack
            Priority: Blocker
             Fix For: 0.20.5


I'm looking at a 0.20.4 cluster of 80 fast machines.  It runs fine for a while and then falls
over into crazy balancing churn (My view on logs is sporadic but have a log before me where
this is happening).  Things I see that seem to be of 0.20.4 particularly:

+ We don't reach an equilibrium or at least it takes so long, its as though it wasn't every
going to happen
+ Master log filled w/ open, close of one or two regions  usually the same ones over and over
(The Regions that are candidates to close are provided by the RS.  They are ordered by hash
of their name.  We return the top ten from this Set every time.  So, we always close the same
regions all the time even if we just opened it)
+ Often, we'll tell an RS to close a region.  It will do the job.  In 0.20.4 we made it so
if RS has any work at all for the master, that we return immediately rather than wait for
the reporting period to elaspse.   So, on these fast machines, it can be back near immediately
if it just opened another some other region, say.  It can get assigned the region it just
closed.  Seems to happen frequently enough.

For example, look at the below extract featuring a single regions life.  Its opened and closed
5 times in about 1/2 a second:

{code}
2010-05-25 11:01:05,488 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo: ProcessRegionClose
of GenericMetaStore,139757491,1274779304880, false, reassign: true
2010-05-25 11:01:05,489 INFO org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set
as unassigned: GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,490 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region
GenericMetaStore,139757491,1274779304880 to a025.example.com,60020,1274744064673
2010-05-25 11:01:05,510 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_OPEN:
GenericMetaStore,139757491,1274779304880 from a025.example.com,60020,1274744064673; 1 of 1
2010-05-25 11:01:05,510 INFO org.apache.hadoop.hbase.master.RegionServerOperation: GenericMetaStore,139757491,1274779304880
open on 10.209.32.189:60020
2010-05-25 11:01:05,511 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated
row GenericMetaStore,139757491,1274779304880 in region .META.,,1 with startcode=1274744064673,
server=10.209.32.189:60020
2010-05-25 11:01:05,548 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close
region GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,552 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_CLOSE:
GenericMetaStore,139757491,1274779304880 from a025.example.com,60020,1274744064673; 1 of 2
2010-05-25 11:01:05,552 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo: ProcessRegionClose
of GenericMetaStore,139757491,1274779304880, false, reassign: true
2010-05-25 11:01:05,552 INFO org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set
as unassigned: GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,556 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region
GenericMetaStore,139757491,1274779304880 to a028.example.com,60020,1274747560769
2010-05-25 11:01:05,578 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_OPEN:
GenericMetaStore,139757491,1274779304880 from a028.example.com,60020,1274747560769; 1 of 1
2010-05-25 11:01:05,578 INFO org.apache.hadoop.hbase.master.RegionServerOperation: GenericMetaStore,139757491,1274779304880
open on 10.209.32.185:60020
2010-05-25 11:01:05,579 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated
row GenericMetaStore,139757491,1274779304880 in region .META.,,1 with startcode=1274747560769,
server=10.209.32.185:60020
2010-05-25 11:01:05,599 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close
region GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,605 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_CLOSE:
GenericMetaStore,139757491,1274779304880 from a028.example.com,60020,1274747560769; 1 of 2
2010-05-25 11:01:05,605 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo: ProcessRegionClose
of GenericMetaStore,139757491,1274779304880, false, reassign: true
2010-05-25 11:01:05,606 INFO org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set
as unassigned: GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,607 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region
GenericMetaStore,139757491,1274779304880 to sjc1c104.example.com,60020,1274747062601
2010-05-25 11:01:05,640 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_OPEN:
GenericMetaStore,139757491,1274779304880 from sjc1c104.example.com,60020,1274747062601; 1
of 1
2010-05-25 11:01:05,640 INFO org.apache.hadoop.hbase.master.RegionServerOperation: GenericMetaStore,139757491,1274779304880
open on 10.209.42.181:60020
2010-05-25 11:01:05,641 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated
row GenericMetaStore,139757491,1274779304880 in region .META.,,1 with startcode=1274747062601,
server=10.209.42.181:60020
2010-05-25 11:01:05,723 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close
region GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,729 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_CLOSE:
GenericMetaStore,139757491,1274779304880 from sjc1c104.example.com,60020,1274747062601; 1
of 4
2010-05-25 11:01:05,729 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo: ProcessRegionClose
of GenericMetaStore,139757491,1274779304880, false, reassign: true
2010-05-25 11:01:05,730 INFO org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set
as unassigned: GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,731 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region
GenericMetaStore,139757491,1274779304880 to sjc1c091.example.com,60020,1274747056415
2010-05-25 11:01:05,751 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_OPEN:
GenericMetaStore,139757491,1274779304880 from sjc1c091.example.com,60020,1274747056415; 1
of 1
2010-05-25 11:01:05,752 INFO org.apache.hadoop.hbase.master.RegionServerOperation: GenericMetaStore,139757491,1274779304880
open on 10.209.42.238:60020
2010-05-25 11:01:05,752 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated
row GenericMetaStore,139757491,1274779304880 in region .META.,,1 with startcode=1274747056415,
server=10.209.42.238:60020
2010-05-25 11:01:05,775 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close
region GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,780 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_CLOSE:
GenericMetaStore,139757491,1274779304880 from sjc1c091.example.com,60020,1274747056415; 1
of 2
2010-05-25 11:01:05,780 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo: ProcessRegionClose
of GenericMetaStore,139757491,1274779304880, false, reassign: true
2010-05-25 11:01:05,780 INFO org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set
as unassigned: GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,808 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region
GenericMetaStore,139757491,1274779304880 to sjc1a003.example.com,60020,1274747057557
2010-05-25 11:01:05,828 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_OPEN:
GenericMetaStore,139757491,1274779304880 from sjc1a003.example.com,60020,1274747057557; 1
of 1
2010-05-25 11:01:05,828 INFO org.apache.hadoop.hbase.master.RegionServerOperation: GenericMetaStore,139757491,1274779304880
open on 10.209.32.148:60020
2010-05-25 11:01:05,829 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated
row GenericMetaStore,139757491,1274779304880 in region .META.,,1 with startcode=1274747057557,
server=10.209.32.148:60020
{code}

The culprit seems to be the code that wants to bring up underloaded regionservers up to average.
 That and something about the lightly loaded servers math that is off.

I'm marking this as a blocker.  I'm not sure why its not more common.  There were some issues
on this cluster regards disks filling but though such an event may have provoked the issue,
we should have evened out eventually.

Making this a blocker on 0.20.5.   Need to fix it for this user at least.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message