Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: jira@apache.org
Date: Sun, 3 Aug 2014 17:42:12 +0000 (UTC)
From: "Josh Elser (JIRA)" <jira@apache.org>
To: notifications@accumulo.apache.org
Message-ID: <JIRA.12709168.1397774005764.3066.1407087732931@arcas>
In-Reply-To: <JIRA.12709168.1397774005764@arcas>
References: <JIRA.12709168.1397774005764@arcas>
Subject: [jira] [Commented] (ACCUMULO-2694) Offline tables block balancing
 for online tables
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/ACCUMULO-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14084045#comment-14084045 ] 

Josh Elser commented on ACCUMULO-2694:
--------------------------------------

I just saw a failure on my jenkins last night for the master branch last night that I've never seen fail before:

{noformat}
Error Message

our test table should exist in [!0, +r]

Stacktrace

java.lang.AssertionError: our test table should exist in [!0, +r]
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.assertTrue(Assert.java:41)
	at org.apache.accumulo.minicluster.impl.MiniAccumuloClusterImplTest.saneMonitorInfo(MiniAccumuloClusterImplTest.java:102)

Standard Output

2014-08-03 08:53:34,483 [impl.MiniAccumuloClusterImplTest] INFO : ensure monitor info includes some base information.
2014-08-03 08:53:34,503 [conf.SiteConfiguration] WARN : accumulo-site.xml not found on classpath
java.lang.Throwable
	at org.apache.accumulo.core.conf.SiteConfiguration.getXmlConfig(SiteConfiguration.java:77)
	at org.apache.accumulo.core.conf.SiteConfiguration.getHadoopConfiguration(SiteConfiguration.java:144)
	at org.apache.accumulo.core.conf.SiteConfiguration.getProperties(SiteConfiguration.java:121)
	at org.apache.accumulo.core.conf.AccumuloConfiguration.iterator(AccumuloConfiguration.java:112)
	at org.apache.accumulo.core.conf.ConfigSanityCheck.validate(ConfigSanityCheck.java:42)
	at org.apache.accumulo.core.conf.SiteConfiguration.getInstance(SiteConfiguration.java:62)
	at org.apache.accumulo.core.conf.SiteConfiguration.getInstance(SiteConfiguration.java:68)
	at org.apache.accumulo.server.security.SystemCredentials$SystemToken.get(SystemCredentials.java:123)
	at org.apache.accumulo.server.security.SystemCredentials$SystemToken.access$000(SystemCredentials.java:98)
	at org.apache.accumulo.server.security.SystemCredentials.<init>(SystemCredentials.java:56)
	at org.apache.accumulo.server.security.SystemCredentials.get(SystemCredentials.java:83)
	at org.apache.accumulo.minicluster.impl.MiniAccumuloClusterImpl.getMasterMonitorInfo(MiniAccumuloClusterImpl.java:764)
	at org.apache.accumulo.minicluster.impl.MiniAccumuloClusterImplTest.saneMonitorInfo(MiniAccumuloClusterImplTest.java:94)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
{noformat}

Can you verify that your changes didn't cause this failure?

> Offline tables block balancing for online tables
> ------------------------------------------------
>
>                 Key: ACCUMULO-2694
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2694
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.4.0, 1.5.0, 1.6.0
>         Environment: 1.6.0-RC2 Started CI with a 10-tablet pre-split table. 
>            Reporter: Mike Drob
>            Assignee: Sean Busbey
>            Priority: Critical
>              Labels: 16_qa_bug
>             Fix For: 1.5.2, 1.6.1, 1.7.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Both DefaultLoadBalancer and ChaoticLoadBalancer won't balance if there are outstanding migrations.
> In this instance, we have offline tables from previous CI runs. one of these tables had outstanding migrations
> {noformat}
> 2014-04-17 09:47:34,716 [balancer.TabletBalancer] DEBUG: Scanning tablet server a2438.halxg.cloudera.com:10011[544d5edf1fec529] for table 8
> 2014-04-17 09:47:36,217 [balancer.TabletBalancer] DEBUG: Scanning tablet server a2416.halxg.cloudera.com:10011[244d5edf0b0c4ff] for table 8
> 2014-04-17 09:47:36,222 [balancer.DefaultLoadBalancer] DEBUG: balance ended with 4 migrations
> 2014-04-17 09:47:36,222 [balancer.DefaultLoadBalancer] DEBUG: balance ended with 0 migrations
> 2014-04-17 09:47:36,222 [master.Master] DEBUG: migration 8;4aa09c;4a809d: a2438.halxg.cloudera.com:10011[544d5edf1fec529] -> a2422.halxg.cloudera.com:10011[3451dd2d9fa6761]
> 2014-04-17 09:47:36,222 [master.Master] DEBUG: migration 8;7e603;7e4029: a2438.halxg.cloudera.com:10011[544d5edf1fec529] -> a2422.halxg.cloudera.com:10011[3451dd2d9fa6761]
> 2014-04-17 09:47:36,222 [master.Master] DEBUG: migration 8;21a044;21803d: a2438.halxg.cloudera.com:10011[544d5edf1fec529] -> a2422.halxg.cloudera.com:10011[3451dd2d9fa6761]
> 2014-04-17 09:47:36,223 [master.Master] DEBUG: migration 8;59c02e;59a02b: a2416.halxg.cloudera.com:10011[244d5edf0b0c4ff] -> a2414.halxg.cloudera.com:10011[444d5f6b43ac4aa]
> {noformat}
> Later messages show these tablets being unloaded successfully. However, since the table is offline they never get loaded on the new tablet server. This means they never leave the queue, so balancing stops.
> As an added complication, this last set of migrations was added after the table was already offline. I think this is because there had been unhosted tablets which caused a bunch of contention around when balancing would finally happen.
> A few needed changes:
> # If the balancer isn't going to balance it needs a log message saying so. Ideally, this message should also include information about the outstanding migrations that are blocking it.
> # the Migration cleanup thread should look for migrations involving offline tables and clear them (I'd prefer this to trying to have the balancer figure out if a table is offline or online)
> # When we offline a table, we should probably clear migrations related to that table. This isn't strictly necessary if the cleanup thread will get them eventually, but it would speed things up.
> Workarounds:
> # migration state is only stored in Master memory, failing over to a different master will force recalculation which will not include offline tables.
> # if for some reason you can't handle a failure of the current master, bringing the involved table back online (which might mean all offline tables) will allow migrations to resume. the table must remain online until there are no longer migrations involving it.
> # I *think* that if you clone the offline table and then delete the original, that will clear the outstanding migrations related to it. I did not test this, because the above two options are much better.
> The latter option will cause considerably more churn, especially if the offline table isn't actually providing utility.


--
This message was sent by Atlassian JIRA
(v6.2#6252)