accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (ACCUMULO-3148) TabletServer didn't get Session expired in HalfDeadTServerIT
Date Fri, 19 Sep 2014 15:12:34 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140704#comment-14140704
] 

Josh Elser edited comment on ACCUMULO-3148 at 9/19/14 3:12 PM:
---------------------------------------------------------------

On the contrary, I don't see anything in the master log which indicates that the master killed
it. The log message is triggered after Watcher fires on the znode for this tserver. The data
is empty, so the master transitions it into the dead tservers set.

{noformat}
2014-09-15 09:40:16,024 [master.Master] WARN : Lost servers [ip-172-31-33-94:40793[14878ae7b920006]]
2014-09-15 09:40:16,024 [master.EventCoordinator] INFO : There are now 0 tablet servers
{noformat}

The above happens in the middle of the tserver "Sleeping" block. When it wakes up, it notices
that it lost its lock.

{noformat}
2014-09-15 09:40:20,088 [tserver.TabletServer] FATAL: Lost tablet server lock (reason = LOCK_DELETED),
exiting.
{noformat}

The master did log a few SocketTimeoutExceptions, but I don't see any indication that it actively
killed the server, rather it died on its own (unless our logging is insufficient in what you're
referencing).


was (Author: elserj):
On the contrary, I don't see anything in the master log which indicates that the master killed
it

{noformat}
2014-09-15 09:40:16,024 [master.Master] WARN : Lost servers [ip-172-31-33-94:40793[14878ae7b920006]]
2014-09-15 09:40:16,024 [master.EventCoordinator] INFO : There are now 0 tablet servers
{noformat}

The above happens in the middle of the tserver "Sleeping" block. When it wakes up, it notices
that it lost its lock

{noformat}
2014-09-15 09:40:20,088 [tserver.TabletServer] FATAL: Lost tablet server lock (reason = LOCK_DELETED),
exiting.
{noformat}

The master did log a few SocketTimeoutExceptions, but I don't see any indication that it actively
killed the server, rather it died on its own (unless our logging is insufficient in what you're
referencing).

> TabletServer didn't get Session expired in HalfDeadTServerIT
> ------------------------------------------------------------
>
>                 Key: ACCUMULO-3148
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3148
>             Project: Accumulo
>          Issue Type: Bug
>          Components: test
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 1.6.1, 1.7.0
>
>
> Beening seeing spurious failures with HalfDeadTServerIT where it doesn't get the ZK session
expiration
> {noformat}
> 2014-09-15 09:39:59,201 [tserver.TabletServer] DEBUG: ScanSess tid 172.31.33.94:35957
!0 0 entries in 0.07 secs, nbTimes = [63 63 63.00 1] 
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> 2014-09-15 09:40:20,088 [tserver.TabletServer] FATAL: Lost tablet server lock (reason
= LOCK_DELETED), exiting.
> 2014-09-15 09:40:20,088 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
for /accumulo/d0b9b8e7-9869-4b00-9ae7-317f5231f2c1/tables/1/conf/table.iterator.minc.vers.opt.maxVersions
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:261)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:153)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:277)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:224)
> 	at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:114)
> 	at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.getProperties(ZooCachePropertyAccessor.java:144)
> 	at org.apache.accumulo.server.conf.TableConfiguration.getProperties(TableConfiguration.java:108)
> 	at org.apache.accumulo.core.conf.AccumuloConfiguration.iterator(AccumuloConfiguration.java:69)
> 	at org.apache.accumulo.core.conf.ConfigSanityCheck.validate(ConfigSanityCheck.java:40)
> 	at org.apache.accumulo.server.conf.ServerConfigurationFactory.getTableConfiguration(ServerConfigurationFactory.java:155)
> 	at org.apache.accumulo.server.conf.ServerConfiguration.getTableConfiguration(ServerConfiguration.java:69)
> 	at org.apache.accumulo.tserver.TabletServer.getTableConfiguration(TabletServer.java:3983)
> 	at org.apache.accumulo.tserver.Tablet.<init>(Tablet.java:1277)
> 	at org.apache.accumulo.tserver.Tablet.<init>(Tablet.java:1256)
> 	at org.apache.accumulo.tserver.Tablet.<init>(Tablet.java:1112)
> 	at org.apache.accumulo.tserver.Tablet.<init>(Tablet.java:1089)
> 	at org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2935)
> 	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> 	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 	at java.lang.Thread.run(Thread.java:745)
> 2014-09-15 09:40:20,090 [tserver.TabletServer] WARN : Check for long GC pauses not called
in a timely fashion. Expected every 5.0 seconds but was 16.3 seconds since last check
> 2014-09-15 09:40:20,477 [datanode.DataNode] ERROR: 127.0.0.1:57185:DataXceiver error
processing WRITE_BLOCK operation  src: /127.0.0.1:42146 dst: /127.0.0.1:57185
> java.io.IOException: Premature EOF from inputStream
> 	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:771)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:718)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:126)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:72)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> 	at java.lang.Thread.run(Thread.java:745)
> {noformat}
> It looks like the tserver killed itself after the connection loss but before the tserver
retried to connect and got the session expiration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message