accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Busbey (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ACCUMULO-2224) ZooSession should be more robust to transient DNS issues
Date Mon, 20 Jan 2014 17:58:23 GMT

     [ https://issues.apache.org/jira/browse/ACCUMULO-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Busbey updated ACCUMULO-2224:
----------------------------------

    Priority: Minor  (was: Major)

AFAICT, ZK will throw as soon as any of the specified hostnames in the connect string resolves
as UnknownHostException.

The workaround for existing releases is to fix the underlying DNS problem and then restart
roles.

Some stack traces of where this came up during testing (for those wishing to dedup errors
they might see)

tserver compaction
{noformat}
Unexpected exception in Split/MajC initiator
	java.lang.RuntimeException: java.net.UnknownHostException: zookeeper1.example.com
		at org.apache.accumulo.core.zookeeper.ZooSession.connect(ZooSession.java:94)
		at org.apache.accumulo.core.zookeeper.ZooSession.getSession(ZooSession.java:142)
		at org.apache.accumulo.core.zookeeper.ZooReader.getSession(ZooReader.java:36)
		at org.apache.accumulo.core.zookeeper.ZooReader.getZooKeeper(ZooReader.java:40)
		at org.apache.accumulo.core.zookeeper.ZooCache.getZooKeeper(ZooCache.java:56)
		at org.apache.accumulo.core.zookeeper.ZooCache.retry(ZooCache.java:127)
		at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:233)
		at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:188)
		at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:121)
		at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:109)
		at org.apache.accumulo.core.conf.AccumuloConfiguration.getMemoryInBytes(AccumuloConfiguration.java:47)
		at org.apache.accumulo.server.tabletserver.Tablet.findSplitRow(Tablet.java:3028)
		at org.apache.accumulo.server.tabletserver.Tablet.needsSplit(Tablet.java:3122)
		at org.apache.accumulo.server.tabletserver.TabletServer$MajorCompactor.run(TabletServer.java:2117)
		at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
		at java.lang.Thread.run(Thread.java:662)
	Caused by: java.net.UnknownHostException: zookeeper1.example.com
		at java.net.InetAddress.getAllByName0(InetAddress.java:1157)
		at java.net.InetAddress.getAllByName(InetAddress.java:1083)
		at java.net.InetAddress.getAllByName(InetAddress.java:1019)
		at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
		at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
		at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
		at org.apache.accumulo.core.zookeeper.ZooSession.connect(ZooSession.java:77)
		... 15 m
{noformat}

logger tracing
{noformat}
2014-01-16 00:00:12,772 [zookeeper.ZooSession] WARN : java.net.UnknownHostException : zookeeper2.example.com
2014-01-16 00:00:12,772 [trace.ZooTraceClient] ERROR: unable to get destination hosts in zookeeper
java.lang.RuntimeException: java.net.UnknownHostException: zookeeper2.example.com
        at org.apache.accumulo.core.zookeeper.ZooSession.connect(ZooSession.java:94)
        at org.apache.accumulo.core.zookeeper.ZooSession.getSession(ZooSession.java:142)
        at org.apache.accumulo.core.zookeeper.ZooReader.getSession(ZooReader.java:37)
        at org.apache.accumulo.server.zookeeper.ZooReaderWriter.getZooKeeper(ZooReaderWriter.java:57)
        at org.apache.accumulo.core.zookeeper.ZooReader.getChildren(ZooReader.java:66)
        at org.apache.accumulo.core.trace.ZooTraceClient.process(ZooTraceClient.java:64)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
{noformat}

master startup (I think):
{noformat}
Caused by: java.lang.RuntimeException: java.net.UnknownHostException: zookeeper1.example.com
        at org.apache.accumulo.core.zookeeper.ZooSession.connect(ZooSession.java:94)
        at org.apache.accumulo.core.zookeeper.ZooSession.getSession(ZooSession.java:142)
        at org.apache.accumulo.core.zookeeper.ZooReader.getSession(ZooReader.java:37)
        at org.apache.accumulo.server.zookeeper.ZooReaderWriter.getZooKeeper(ZooReaderWriter.java:57)
        at org.apache.accumulo.core.zookeeper.ZooReader.getChildren(ZooReader.java:61)
        at org.apache.accumulo.server.Accumulo.waitForZookeeperAndHdfs(Accumulo.java:201)
        at org.apache.accumulo.server.master.state.SetGoalState.main(SetGoalState.java:40)

{noformat}

Continuous Ingest stats collector
{noformat}
1389860157417 Failed to collect stats : java.net.UnknownHostException: zookeeper1.example.com
java.lang.RuntimeException: java.net.UnknownHostException: zookeeper1.example.com
        at org.apache.accumulo.core.zookeeper.ZooSession.connect(ZooSession.java:94)
        at org.apache.accumulo.core.zookeeper.ZooSession.getSession(ZooSession.java:142)
        at org.apache.accumulo.core.zookeeper.ZooReader.getSession(ZooReader.java:37)
        at org.apache.accumulo.core.zookeeper.ZooReader.getZooKeeper(ZooReader.java:41)
        at org.apache.accumulo.core.zookeeper.ZooCache.getZooKeeper(ZooCache.java:56)
        at org.apache.accumulo.core.zookeeper.ZooCache.retry(ZooCache.java:127)
        at org.apache.accumulo.core.zookeeper.ZooCache.getChildren(ZooCache.java:178)
        at org.apache.accumulo.server.zookeeper.ZooLock.getLockData(ZooLock.java:414)
        at org.apache.accumulo.server.client.HdfsZooInstance.getMasterLocations(HdfsZooInstance.java:102)
        at org.apache.accumulo.core.client.impl.MasterClient.getConnection(MasterClient.java:52)
        at org.apache.accumulo.core.client.impl.MasterClient.getConnectionWithRetry(MasterClient.java:43)
        at org.apache.accumulo.server.test.continuous.ContinuousStatsCollector$StatsCollectionTask.getACUStats(ContinuousStatsCollector.java:128)
        at org.apache.accumulo.server.test.continuous.ContinuousStatsCollector$StatsCollectionTask.run(ContinuousStatsCollector.java:77)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
{noformat}

Continuous Ingest scanner (probably all BatchScanners)
{noformat}
Caused by: java.lang.RuntimeException: java.net.UnknownHostException: zookeeper1.example.com
        at org.apache.accumulo.core.zookeeper.ZooSession.connect(ZooSession.java:94)
        at org.apache.accumulo.core.zookeeper.ZooSession.getSession(ZooSession.java:142)
        at org.apache.accumulo.core.zookeeper.ZooReader.getSession(ZooReader.java:37)
        at org.apache.accumulo.core.zookeeper.ZooReader.getZooKeeper(ZooReader.java:41)
        at org.apache.accumulo.core.zookeeper.ZooCache.getZooKeeper(ZooCache.java:56)
        at org.apache.accumulo.core.zookeeper.ZooCache.retry(ZooCache.java:127)
        at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:233)
        at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:188)
        at org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:148)
        at org.apache.accumulo.core.client.impl.TabletLocator.getInstance(TabletLocator.java:96)
        at org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:245)
        at org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:94)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
{noformat}

Continous Ingest writer (probably all users of BatchWriter):
{noformat}
Caused by: java.lang.RuntimeException: java.net.UnknownHostException: zookeeper1.example.com
        at org.apache.accumulo.core.zookeeper.ZooSession.connect(ZooSession.java:94)
        at org.apache.accumulo.core.zookeeper.ZooSession.getSession(ZooSession.java:142)
        at org.apache.accumulo.core.zookeeper.ZooReader.getSession(ZooReader.java:37)
        at org.apache.accumulo.core.zookeeper.ZooReader.getZooKeeper(ZooReader.java:41)
        at org.apache.accumulo.core.zookeeper.ZooCache.getZooKeeper(ZooCache.java:56)
        at org.apache.accumulo.core.zookeeper.ZooCache.retry(ZooCache.java:127)
        at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:233)
        at org.apache.accumulo.core.zookeeper.ZooCache.get(ZooCache.java:188)
        at org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:148)
        at org.apache.accumulo.core.client.impl.TabletLocator.getInstance(TabletLocator.java:96)
        at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter$SendTask.send(TabletServerBatchWriter.java:733)
        at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter$SendTask.run(TabletServerBatchWriter.java:671)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
{noformat}

> ZooSession should be more robust to transient DNS issues
> --------------------------------------------------------
>
>                 Key: ACCUMULO-2224
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2224
>             Project: Accumulo
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.4.1, 1.4.2, 1.4.3, 1.4.4, 1.5.0
>         Environment: 1.4.5-SNAP on CDH4 w/gremlins
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>            Priority: Minor
>             Fix For: 1.4.5, 1.5.1, 1.6.0
>
>
> While injecting network faults, I found that transient DNS problems caused us to bail
out of ZooSessions rather than retrying as we do for all other IO problems. We should retry
these failures just as we do for Connection Refused or other networking problems.
> Since the addition of ACCUMULO-131, we can be sure that we won't retry actual invalid
hosts for ever. Instead, after the time out period that holds for all other problems we'll
properly exit.
> The warn messages logged for IOExceptions should suffice to indicate improperly specified
host names.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message