accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Lee <a...@orbistechnologies.com>
Subject RE: Tablet server stuck waiting for lock
Date Wed, 05 Mar 2014 18:33:19 GMT
Eric,

I had previously done a stop-all.sh, so I started everything back up again to check this.

Zoo1:2181, follower, 4 clients
Zoo2:2181, leader, 3 clients
Zoo3:2181, follower, 5 clients

The behavior of the tablet servers seems to be inconsistent. Upon restarting Accumulo, the
overview page gives the impression that everything is fine. However, it is only listing 1
tablet server (Node 1), and it is now tablet server #2 that is timing out while "Waiting for
tablet server lock". There are no errors in tablet server #1's logs right now, while tablet
server #2 failed after too many retries of waiting for the lock.

Also, the master log still has no mention of the second node's IP address, even though the
start-all.sh script indicates that the tserver and logger are being started on both nodes.

Thanks,

Alex


From: Eric Newton [mailto:eric.newton@gmail.com]
Sent: Wednesday, March 05, 2014 1:25 PM
To: user@accumulo.apache.org
Subject: Re: Tablet server stuck waiting for lock

On the monitor page, there's a box that shows your zookeepers and their status.  What does
it say?

-Eric


On Wed, Mar 5, 2014 at 1:09 PM, Alex Lee <alee@orbistechnologies.com<mailto:alee@orbistechnologies.com>>
wrote:
Dfs permissions is currently disabled. I'm using the accumulo user for "accumulo init" and
for "start-all.sh", and it is also the user that has passwordless SSH enabled.

I ran "hadoop fs -ls /accumulo" as the accumulo user on both tablet servers, and I am able
to see inside of the /accumulo directory on hdfs.

Alex

From: Ott, Charlie H. [mailto:CHARLES.H.OTT@leidos.com<mailto:CHARLES.H.OTT@leidos.com>]
Sent: Wednesday, March 05, 2014 1:02 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: RE: Tablet server stuck waiting for lock

The connection reset by peer from the Master in combination with the lock not acquired by
the tablet server makes me wonder if the process owner for the tablet server is able to access
HDFS correctly.

Are dfs permissions enabled on your HDFS?  It makes me think the tablet server does not have
permissions to read from the /accumulo path that was initialized on the master.  Did you use
the same account for 'accumulo init' ?



From: user-return-3823-CHARLES.H.OTT=leidos.com@accumulo.apache.org<mailto:user-return-3823-CHARLES.H.OTT=leidos.com@accumulo.apache.org>
[mailto:user-return-3823-CHARLES.H.OTT=leidos.com@accumulo.apache.org] On Behalf Of Alex Lee
Sent: Wednesday, March 05, 2014 12:17 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: Tablet server stuck waiting for lock

Hello,

I'm trying to create a virtualized Accumulo 1.4.4 cluster with 4 tablet servers using Hadoop
0.20.2 and ZooKeeper 3.3.5. It didn't seem to be working correctly with 4 tablet servers,
so I first tried just running with one tablet server, which seemed to work fine. When I tried
to run it with just 2 tablet servers, I ran into some issues.

Just to preface, I double checked configs within zookeeper and accumulo, and everything matches.
All hostnames are resolving correctly, and passwordless SSH for the accumulo user is also
functional between all nodes. Running "echo stat | nc <zk-server> <zk port>" responds
appropriately.

Here's the first error log for the Tablet Master:

2014-03-05 11:18:16,626 [master.Master] ERROR: Error processing table state for store Root
Tablet
org.apache.thrift.transport.TTransportException: java.io.IOException: Connection reset by
peer
        at org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:161)
        at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:158)
        at org.apache.accumulo.core.client.impl.ThriftTransportPool$CachedTTransport.flush(ThriftTransportPool.java:299)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.send_loadTablet(TabletClientService.java:653)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.loadTablet(TabletClientService.java:640)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$2.invoke(TraceWrap.java:84)
        at com.sun.proxy.$Proxy4.loadTablet(Unknown Source)
        at org.apache.accumulo.server.master.LiveTServerSet$TServerConnection.assignTablet(LiveTServerSet.java:86)
        at org.apache.accumulo.server.master.Master$TabletGroupWatcher.flushChanges(Master.java:1818)
        at org.apache.accumulo.server.master.Master$TabletGroupWatcher.run(Master.java:1426)
Caused by: java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(Unknown Source)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source)
        at sun.nio.ch.IOUtil.write(Unknown Source)
        at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
        at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
        at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
        at java.io.BufferedOutputStream.flush(Unknown Source)
        at org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:159)
        ... 13 more

Here are the error logs for Tablet Server #1:

2014-03-05 11:17:15,152 [tabletserver.TabletServer] INFO : Tablet server starting on 172.16.111.3
2014-03-05 11:17:15,187 [util.FileSystemMonitor] INFO : Filesystem monitor started
2014-03-05 11:17:15,194 [tabletserver.NativeMap] INFO : Loaded native map shared library /opt/accumulo/accumulo/lib/native/map/libNativeMap-Linux-amd64-64.so
2014-03-05 11:17:15,499 [tabletserver.TabletServer] INFO : port = 9997
2014-03-05 11:17:15,540 [tabletserver.TabletServer] INFO : Waiting for tablet server lock
2014-03-05 11:17:16,633 [tabletserver.TabletServer] WARN : Got loadTablet message from master
before lock acquired, ignoring...
2014-03-05 11:17:16,634 [server.TNonblockingServer] ERROR: Unexpected exception while invoking!
java.lang.RuntimeException: Lock not acquired
        at org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.checkPermission(TabletServer.java:1782)
        at org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.loadTablet(TabletServer.java:1814)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:59)
        at com.sun.proxy.$Proxy1.loadTablet(Unknown Source)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$loadTablet.process(TabletClientService.java:2510)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor.process(TabletClientService.java:2053)
        at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:154)
        at org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631)
        at org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:202)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
        at java.lang.Thread.run(Unknown Source)
2014-03-05 11:17:20,564 [tabletserver.TabletServer] INFO : Waiting for tablet server lock
2014-03-05 11:17:25,589 [tabletserver.TabletServer] INFO : Waiting for tablet server lock

(continues until too many retries, then exits)

Tablet Server #2's logs get as far as this (below), and then just stop.

2014-03-05 11:17:14,112 [tabletserver.TabletServer] INFO : Tablet server starting on 172.16.111.3
2014-03-05 11:17:14,149 [util.FileSystemMonitor] INFO : Filesystem monitor started
2014-03-05 11:17:14,157 [tabletserver.NativeMap] INFO : Loaded native map shared library /opt/accumulo/accumulo/lib/native/map/libNativeMap-Linux-amd64-64.so
2014-03-05 11:17:14,481 [tabletserver.TabletServer] INFO : port = 9997

Also, the master logs interestingly never make any calls to Tablet #2's IP address.

Any thoughts? We have another cluster that is setup identically in just about every way (besides
hostnames), but it has never experienced any of these issues. My research shows that these
issues can exist within 1.4.3, which we were using at first, but we switched to 1.4.4 because
these types of issues were supposed to be resolved. Any help would be greatly appreciated.

Thanks,

Alex Lee


Mime
View raw message