hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-19598) Fix TestAssignmentManagerMetrics flaky test
Date Wed, 17 Jan 2018 06:30:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16328332#comment-16328332

stack commented on HBASE-19598:

Test helped [~balazs.meszaros]

.001 Root issue is that Master was stuck in waitForMasterActive, regions were being assigned
to Master, and the metrics we were expecting were incorrect (if the killed regionserver was
hosting user-space and hbase:meta region).

Master never left waitForMasterActive because it never checked state of the clusterUp flag.
The test here was aborting regionserver and then just exiting. The minihbasecluster shutdown
sets the cluster down flag but we were never looking at it so Master thread was staying up.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java Changed
log from ERROR to WARN and suppressed stack trace. This is the 'stop' method. It should allow
that we may be going down a little unclean. No need of spew in logs.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java The tableOnMaster
check in waitForMasterActive looks wrong. It was making it so a 'normal' Master was getting
stuck in here. This is not the place to worry about tablesOnMaster. That is for the balancer
to be concerned with. There is a problem with Master hosting system-tables-only. After further
study, Master can carry regions like a regionserver but making it so it carries system tables
only is tricky given meta assign happens ahead of all others which means that the Master needs
to have checked-in as a regionserver super early... It needs work. Punted for now. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
Mostly renaming so lists and maps of region infos have same name as they have elsewhere in
code base and cleaning up confusion that may arise when we talk of servers-for-system-tables....It
is talking about something else in the code changes here that is other than the normal understanding.
It is about filtering regionservers by their version numbers so we favor regions with higher
version numbers. Needs to go back up into the balancer.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/BaseLoadBalancer.java
It was possible for the Master to be given regions if no regionservers available (as per the
failing unit test in this case).

M hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java Minor
reordering moving the waitForMasterActive later in the initialize and wrapping each test in
a check if we are to keep looping (which checks cluster status flag).

M hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerMetrics.java
This was an old test from the days when Master carried system tables. Updated test and fixed
metrics. Metrics count the hbase:meta along with the userspace region so upped expected numbers
(previously the hbase:meta was hosted on the master so metrics were not incremented).

M hbase-server/src/test/java/org/apache/hadoop/hbase/master/balancer/TestRegionsOnMasterOptions.java
I took a look at this test again but nope, needs a load of work still to make it pass.

M hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java Stop being so

> Fix TestAssignmentManagerMetrics flaky test
> -------------------------------------------
>                 Key: HBASE-19598
>                 URL: https://issues.apache.org/jira/browse/HBASE-19598
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0-beta-1
>            Reporter: Balazs Meszaros
>            Assignee: Balazs Meszaros
>            Priority: Major
>         Attachments: HBASE-19598.master.001.patch, TestUtil.java
> TestAssignmentManagerMetrics fails constantly. After bisecting, it seems that commit
010012cbcb broke it (HBASE-18946).
> The test method runs successfully, but it cannot shut the minicluster down, and hangs

This message was sent by Atlassian JIRA

View raw message