accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <eric.new...@gmail.com>
Subject Re: Tablet server stuck waiting for lock
Date Wed, 05 Mar 2014 18:42:51 GMT
It looks like both servers are resolving their address to be 172.16.111.3.

-Eric



On Wed, Mar 5, 2014 at 1:33 PM, Alex Lee <alee@orbistechnologies.com> wrote:

>  Eric,
>
>
>
> I had previously done a stop-all.sh, so I started everything back up again
> to check this.
>
>
>
> Zoo1:2181, follower, 4 clients
>
> Zoo2:2181, leader, 3 clients
>
> Zoo3:2181, follower, 5 clients
>
>
>
> The behavior of the tablet servers seems to be inconsistent. Upon
> restarting Accumulo, the overview page gives the impression that everything
> is fine. However, it is only listing 1 tablet server (Node 1), and it is
> now tablet server #2 that is timing out while "Waiting for tablet server
> lock". There are no errors in tablet server #1's logs right now, while
> tablet server #2 failed after too many retries of waiting for the lock.
>
>
>
> Also, the master log still has no mention of the second node's IP address,
> even though the start-all.sh script indicates that the tserver and logger
> are being started on both nodes.
>
>
>
> Thanks,
>
>
>
> Alex
>
>
>
>
>
> *From:* Eric Newton [mailto:eric.newton@gmail.com]
> *Sent:* Wednesday, March 05, 2014 1:25 PM
> *To:* user@accumulo.apache.org
> *Subject:* Re: Tablet server stuck waiting for lock
>
>
>
> On the monitor page, there's a box that shows your zookeepers and their
> status.  What does it say?
>
>
>
> -Eric
>
>
>
>
>
> On Wed, Mar 5, 2014 at 1:09 PM, Alex Lee <alee@orbistechnologies.com>
> wrote:
>
>  Dfs permissions is currently disabled. I'm using the accumulo user for
> "accumulo init" and for "start-all.sh", and it is also the user that has
> passwordless SSH enabled.
>
>
>
> I ran "hadoop fs -ls /accumulo" as the accumulo user on both tablet
> servers, and I am able to see inside of the /accumulo directory on hdfs.
>
>
>
> Alex
>
>
>
> *From:* Ott, Charlie H. [mailto:CHARLES.H.OTT@leidos.com]
> *Sent:* Wednesday, March 05, 2014 1:02 PM
> *To:* user@accumulo.apache.org
> *Subject:* RE: Tablet server stuck waiting for lock
>
>
>
> The connection reset by peer from the Master in combination with the lock
> not acquired by the tablet server makes me wonder if the process owner for
> the tablet server is able to access HDFS correctly.
>
>
>
> Are dfs permissions enabled on your HDFS?  It makes me think the tablet
> server does not have permissions to read from the /accumulo path that was
> initialized on the master.  Did you use the same account for 'accumulo
> init' ?
>
>
>
>
>
>
>
> *From:* user-return-3823-CHARLES.H.OTT=leidos.com@accumulo.apache.org [
> mailto:user-return-3823-CHARLES.H.OTT=leidos.com@accumulo.apache.org<user-return-3823-CHARLES.H.OTT=leidos.com@accumulo.apache.org>]
> *On Behalf Of *Alex Lee
> *Sent:* Wednesday, March 05, 2014 12:17 PM
> *To:* user@accumulo.apache.org
> *Subject:* Tablet server stuck waiting for lock
>
>
>
> Hello,
>
>
>
> I'm trying to create a virtualized Accumulo 1.4.4 cluster with 4 tablet
> servers using Hadoop 0.20.2 and ZooKeeper 3.3.5. It didn't seem to be
> working correctly with 4 tablet servers, so I first tried just running with
> one tablet server, which seemed to work fine. When I tried to run it with
> just 2 tablet servers, I ran into some issues.
>
>
>
> Just to preface, I double checked configs within zookeeper and accumulo,
> and everything matches. All hostnames are resolving correctly, and
> passwordless SSH for the accumulo user is also functional between all
> nodes. Running "echo stat | nc <zk-server> <zk port>" responds
> appropriately.
>
>
>
> Here's the first error log for the Tablet Master:
>
>
>
> 2014-03-05 11:18:16,626 [master.Master] ERROR: Error processing table
> state for store Root Tablet
>
> org.apache.thrift.transport.TTransportException: java.io.IOException:
> Connection reset by peer
>
>         at
> org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:161)
>
>         at
> org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:158)
>
>         at
> org.apache.accumulo.core.client.impl.ThriftTransportPool$CachedTTransport.flush(ThriftTransportPool.java:299)
>
>         at
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.send_loadTablet(TabletClientService.java:653)
>
>         at
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.loadTablet(TabletClientService.java:640)
>
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>         at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>
>         at java.lang.reflect.Method.invoke(Unknown Source)
>
>         at
> org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$2.invoke(TraceWrap.java:84)
>
>         at com.sun.proxy.$Proxy4.loadTablet(Unknown Source)
>
>         at
> org.apache.accumulo.server.master.LiveTServerSet$TServerConnection.assignTablet(LiveTServerSet.java:86)
>
>         at
> org.apache.accumulo.server.master.Master$TabletGroupWatcher.flushChanges(Master.java:1818)
>
>         at
> org.apache.accumulo.server.master.Master$TabletGroupWatcher.run(Master.java:1426)
>
> Caused by: java.io.IOException: Connection reset by peer
>
>         at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>
>         at sun.nio.ch.SocketDispatcher.write(Unknown Source)
>
>         at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source)
>
>         at sun.nio.ch.IOUtil.write(Unknown Source)
>
>         at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
>
>         at
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>
>         at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>
>         at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>
>         at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
>
>         at java.io.BufferedOutputStream.flush(Unknown Source)
>
>         at
> org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:159)
>
>         ... 13 more
>
>
>
> Here are the error logs for Tablet Server #1:
>
>
>
> 2014-03-05 11:17:15,152 [tabletserver.TabletServer] INFO : Tablet server
> starting on 172.16.111.3
>
> 2014-03-05 11:17:15,187 [util.FileSystemMonitor] INFO : Filesystem monitor
> started
>
> 2014-03-05 11:17:15,194 [tabletserver.NativeMap] INFO : Loaded native map
> shared library
> /opt/accumulo/accumulo/lib/native/map/libNativeMap-Linux-amd64-64.so
>
> 2014-03-05 11:17:15,499 [tabletserver.TabletServer] INFO : port = 9997
>
> 2014-03-05 11:17:15,540 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
>
> 2014-03-05 11:17:16,633 [tabletserver.TabletServer] WARN : Got loadTablet
> message from master before lock acquired, ignoring...
>
> 2014-03-05 11:17:16,634 [server.TNonblockingServer] ERROR: Unexpected
> exception while invoking!
>
> java.lang.RuntimeException: Lock not acquired
>
>         at
> org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.checkPermission(TabletServer.java:1782)
>
>         at
> org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.loadTablet(TabletServer.java:1814)
>
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>         at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>
>         at java.lang.reflect.Method.invoke(Unknown Source)
>
>         at
> org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:59)
>
>         at com.sun.proxy.$Proxy1.loadTablet(Unknown Source)
>
>         at
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$loadTablet.process(TabletClientService.java:2510)
>
>         at
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor.process(TabletClientService.java:2053)
>
>         at
> org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:154)
>
>         at
> org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631)
>
>         at
> org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:202)
>
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
>
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
>
>         at
> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>
>         at java.lang.Thread.run(Unknown Source)
>
> 2014-03-05 11:17:20,564 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
>
> 2014-03-05 11:17:25,589 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
>
>
>
> (continues until too many retries, then exits)
>
>
>
> Tablet Server #2's logs get as far as this (below), and then just stop.
>
>
>
> 2014-03-05 11:17:14,112 [tabletserver.TabletServer] INFO : Tablet server
> starting on 172.16.111.3
>
> 2014-03-05 11:17:14,149 [util.FileSystemMonitor] INFO : Filesystem monitor
> started
>
> 2014-03-05 11:17:14,157 [tabletserver.NativeMap] INFO : Loaded native map
> shared library
> /opt/accumulo/accumulo/lib/native/map/libNativeMap-Linux-amd64-64.so
>
> 2014-03-05 11:17:14,481 [tabletserver.TabletServer] INFO : port = 9997
>
>
>
> Also, the master logs interestingly never make any calls to Tablet #2's IP
> address.
>
>
>
> Any thoughts? We have another cluster that is setup identically in just
> about every way (besides hostnames), but it has never experienced any of
> these issues. My research shows that these issues can exist within 1.4.3,
> which we were using at first, but we switched to 1.4.4 because these types
> of issues were supposed to be resolved. Any help would be greatly
> appreciated.
>
>
>
> Thanks,
>
>
>
> Alex Lee
>
>
>

Mime
View raw message