hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "HBase Review Board (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2617) Load balancer falls into pathological state if one server under average - slop; endless churn
Date Thu, 27 May 2010 17:14:39 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872291#action_12872291
] 

HBase Review Board commented on HBASE-2617:
-------------------------------------------

Message from: stack@duboce.net

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/97/
-----------------------------------------------------------

Review request for hbase and Ryan Rawson.


Summary
-------

Fix for pathological balancer.  Mostly logging changes to remove confusion as to what the
balancer is up to but also removed a mistake where we'd always give a server one region ('just
in case") if regions to assign though it might be a server that is loaded.  In clusters w/
lots of nodes with a few of the nodes not in-balance, the lightly loaded nodes would never
get regions so we'd just keep reassigning and never settle.


This addresses bug hbase-2617.


Diffs
-----

  branches/0.20/src/java/org/apache/hadoop/hbase/master/RegionManager.java 948860 

Diff: http://review.hbase.org/r/97/diff


Testing
-------


Thanks,

stack




> Load balancer falls into pathological state if one server under average - slop; endless
churn
> ---------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2617
>                 URL: https://issues.apache.org/jira/browse/HBASE-2617
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.20.5
>
>         Attachments: 2617.txt
>
>
> I'm looking at a 0.20.4 cluster of 80 fast machines.  It runs fine for a while and then
falls over into crazy balancing churn (My view on logs is sporadic but have a log before me
where this is happening).  Things I see that seem to be of 0.20.4 particularly:
> + We don't reach an equilibrium or at least it takes so long, its as though it wasn't
every going to happen
> + Master log filled w/ open, close of one or two regions  usually the same ones over
and over (The Regions that are candidates to close are provided by the RS.  They are ordered
by hash of their name.  We return the top ten from this Set every time.  So, we always close
the same regions all the time even if we just opened it)
> + Often, we'll tell an RS to close a region.  It will do the job.  In 0.20.4 we made
it so if RS has any work at all for the master, that we return immediately rather than wait
for the reporting period to elaspse.   So, on these fast machines, it can be back near immediately
if it just opened another some other region, say.  It can get assigned the region it just
closed.  Seems to happen frequently enough.
> For example, look at the below extract featuring a single regions life.  Its opened and
closed 5 times in about 1/2 a second:
> {code}
> 2010-05-25 11:01:05,488 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo:
ProcessRegionClose of GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,489 INFO org.apache.hadoop.hbase.master.ProcessRegionClose$1: region
set as unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,490 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning
region GenericMetaStore,139757491,1274779304880 to a025.example.com,60020,1274744064673
> 2010-05-25 11:01:05,510 INFO org.apache.hadoop.hbase.master.ServerManager: Processing
MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from a025.example.com,60020,1274744064673;
1 of 1
> 2010-05-25 11:01:05,510 INFO org.apache.hadoop.hbase.master.RegionServerOperation: GenericMetaStore,139757491,1274779304880
open on 10.209.32.189:60020
> 2010-05-25 11:01:05,511 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated
row GenericMetaStore,139757491,1274779304880 in region .META.,,1 with startcode=1274744064673,
server=10.209.32.189:60020
> 2010-05-25 11:01:05,548 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to
close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,552 INFO org.apache.hadoop.hbase.master.ServerManager: Processing
MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from a025.example.com,60020,1274744064673;
1 of 2
> 2010-05-25 11:01:05,552 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo:
ProcessRegionClose of GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,552 INFO org.apache.hadoop.hbase.master.ProcessRegionClose$1: region
set as unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,556 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning
region GenericMetaStore,139757491,1274779304880 to a028.example.com,60020,1274747560769
> 2010-05-25 11:01:05,578 INFO org.apache.hadoop.hbase.master.ServerManager: Processing
MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from a028.example.com,60020,1274747560769;
1 of 1
> 2010-05-25 11:01:05,578 INFO org.apache.hadoop.hbase.master.RegionServerOperation: GenericMetaStore,139757491,1274779304880
open on 10.209.32.185:60020
> 2010-05-25 11:01:05,579 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated
row GenericMetaStore,139757491,1274779304880 in region .META.,,1 with startcode=1274747560769,
server=10.209.32.185:60020
> 2010-05-25 11:01:05,599 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to
close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,605 INFO org.apache.hadoop.hbase.master.ServerManager: Processing
MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from a028.example.com,60020,1274747560769;
1 of 2
> 2010-05-25 11:01:05,605 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo:
ProcessRegionClose of GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,606 INFO org.apache.hadoop.hbase.master.ProcessRegionClose$1: region
set as unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,607 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning
region GenericMetaStore,139757491,1274779304880 to sjc1c104.example.com,60020,1274747062601
> 2010-05-25 11:01:05,640 INFO org.apache.hadoop.hbase.master.ServerManager: Processing
MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from sjc1c104.example.com,60020,1274747062601;
1 of 1
> 2010-05-25 11:01:05,640 INFO org.apache.hadoop.hbase.master.RegionServerOperation: GenericMetaStore,139757491,1274779304880
open on 10.209.42.181:60020
> 2010-05-25 11:01:05,641 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated
row GenericMetaStore,139757491,1274779304880 in region .META.,,1 with startcode=1274747062601,
server=10.209.42.181:60020
> 2010-05-25 11:01:05,723 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to
close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,729 INFO org.apache.hadoop.hbase.master.ServerManager: Processing
MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from sjc1c104.example.com,60020,1274747062601;
1 of 4
> 2010-05-25 11:01:05,729 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo:
ProcessRegionClose of GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,730 INFO org.apache.hadoop.hbase.master.ProcessRegionClose$1: region
set as unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,731 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning
region GenericMetaStore,139757491,1274779304880 to sjc1c091.example.com,60020,1274747056415
> 2010-05-25 11:01:05,751 INFO org.apache.hadoop.hbase.master.ServerManager: Processing
MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from sjc1c091.example.com,60020,1274747056415;
1 of 1
> 2010-05-25 11:01:05,752 INFO org.apache.hadoop.hbase.master.RegionServerOperation: GenericMetaStore,139757491,1274779304880
open on 10.209.42.238:60020
> 2010-05-25 11:01:05,752 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated
row GenericMetaStore,139757491,1274779304880 in region .META.,,1 with startcode=1274747056415,
server=10.209.42.238:60020
> 2010-05-25 11:01:05,775 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to
close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,780 INFO org.apache.hadoop.hbase.master.ServerManager: Processing
MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from sjc1c091.example.com,60020,1274747056415;
1 of 2
> 2010-05-25 11:01:05,780 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo:
ProcessRegionClose of GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,780 INFO org.apache.hadoop.hbase.master.ProcessRegionClose$1: region
set as unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,808 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning
region GenericMetaStore,139757491,1274779304880 to sjc1a003.example.com,60020,1274747057557
> 2010-05-25 11:01:05,828 INFO org.apache.hadoop.hbase.master.ServerManager: Processing
MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from sjc1a003.example.com,60020,1274747057557;
1 of 1
> 2010-05-25 11:01:05,828 INFO org.apache.hadoop.hbase.master.RegionServerOperation: GenericMetaStore,139757491,1274779304880
open on 10.209.32.148:60020
> 2010-05-25 11:01:05,829 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated
row GenericMetaStore,139757491,1274779304880 in region .META.,,1 with startcode=1274747057557,
server=10.209.32.148:60020
> {code}
> The culprit seems to be the code that wants to bring up underloaded regionservers up
to average.  That and something about the lightly loaded servers math that is off.
> I'm marking this as a blocker.  I'm not sure why its not more common.  There were some
issues on this cluster regards disks filling but though such an event may have provoked the
issue, we should have evened out eventually.
> Making this a blocker on 0.20.5.   Need to fix it for this user at least.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message