Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0AE3A10707 for ; Wed, 5 Mar 2014 18:25:25 +0000 (UTC) Received: (qmail 44427 invoked by uid 500); 5 Mar 2014 18:25:23 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 44390 invoked by uid 500); 5 Mar 2014 18:25:23 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 44380 invoked by uid 99); 5 Mar 2014 18:25:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Mar 2014 18:25:21 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of eric.newton@gmail.com designates 209.85.216.51 as permitted sender) Received: from [209.85.216.51] (HELO mail-qa0-f51.google.com) (209.85.216.51) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Mar 2014 18:25:14 +0000 Received: by mail-qa0-f51.google.com with SMTP id cm18so1327552qab.38 for ; Wed, 05 Mar 2014 10:24:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=7oDPh3pvq9B44eUOGFFPnAbjsukBw+cXDrZHIPRrlbE=; b=w6jVwdIj8nXcnermZk7R4E+p3lVrc5d72YgGyAQ8dwCnLv57n3hSrYgdKbrkvsaDse ZawoJ4Uu8HVRrBzOnXgSMlC+QRqy0IUHb6Q9j/wbGKDK3PDvpGHBK4JnRV0Pw2OKftd9 TAJmDJQreiCSJxUKQbLSD79DSx2N8TadVtcEJooOcYIaoocpIflheXF1HIjriwk1m6J2 2W5h8gTqb6Ze3b8u5qL5W5d67oKS/ZVdXKcc0iiEMWrgY4yDBpfgggOooy3Kz5e5Z5ZQ A6EATCibDQsK7Jb7yXt1Q6sviDo4dCKsbQfg/60/o9trtq4TZnGhRywHE0yXh+3xQKoG bzeA== MIME-Version: 1.0 X-Received: by 10.140.86.202 with SMTP id p68mr8071202qgd.81.1394043893689; Wed, 05 Mar 2014 10:24:53 -0800 (PST) Received: by 10.96.41.40 with HTTP; Wed, 5 Mar 2014 10:24:53 -0800 (PST) In-Reply-To: <88042692adc34d899f0ce819d6c35ab3@BLUPR05MB248.namprd05.prod.outlook.com> References: <88042692adc34d899f0ce819d6c35ab3@BLUPR05MB248.namprd05.prod.outlook.com> Date: Wed, 5 Mar 2014 13:24:53 -0500 Message-ID: Subject: Re: Tablet server stuck waiting for lock From: Eric Newton To: "user@accumulo.apache.org" Content-Type: multipart/alternative; boundary=001a11c131b815b78b04f3e0230c X-Virus-Checked: Checked by ClamAV on apache.org --001a11c131b815b78b04f3e0230c Content-Type: text/plain; charset=ISO-8859-1 On the monitor page, there's a box that shows your zookeepers and their status. What does it say? -Eric On Wed, Mar 5, 2014 at 1:09 PM, Alex Lee wrote: > Dfs permissions is currently disabled. I'm using the accumulo user for > "accumulo init" and for "start-all.sh", and it is also the user that has > passwordless SSH enabled. > > > > I ran "hadoop fs -ls /accumulo" as the accumulo user on both tablet > servers, and I am able to see inside of the /accumulo directory on hdfs. > > > > Alex > > > > *From:* Ott, Charlie H. [mailto:CHARLES.H.OTT@leidos.com] > *Sent:* Wednesday, March 05, 2014 1:02 PM > *To:* user@accumulo.apache.org > *Subject:* RE: Tablet server stuck waiting for lock > > > > The connection reset by peer from the Master in combination with the lock > not acquired by the tablet server makes me wonder if the process owner for > the tablet server is able to access HDFS correctly. > > > > Are dfs permissions enabled on your HDFS? It makes me think the tablet > server does not have permissions to read from the /accumulo path that was > initialized on the master. Did you use the same account for 'accumulo > init' ? > > > > > > > > *From:* user-return-3823-CHARLES.H.OTT=leidos.com@accumulo.apache.org [ > mailto:user-return-3823-CHARLES.H.OTT=leidos.com@accumulo.apache.org] > *On Behalf Of *Alex Lee > *Sent:* Wednesday, March 05, 2014 12:17 PM > *To:* user@accumulo.apache.org > *Subject:* Tablet server stuck waiting for lock > > > > Hello, > > > > I'm trying to create a virtualized Accumulo 1.4.4 cluster with 4 tablet > servers using Hadoop 0.20.2 and ZooKeeper 3.3.5. It didn't seem to be > working correctly with 4 tablet servers, so I first tried just running with > one tablet server, which seemed to work fine. When I tried to run it with > just 2 tablet servers, I ran into some issues. > > > > Just to preface, I double checked configs within zookeeper and accumulo, > and everything matches. All hostnames are resolving correctly, and > passwordless SSH for the accumulo user is also functional between all > nodes. Running "echo stat | nc " responds > appropriately. > > > > Here's the first error log for the Tablet Master: > > > > 2014-03-05 11:18:16,626 [master.Master] ERROR: Error processing table > state for store Root Tablet > > org.apache.thrift.transport.TTransportException: java.io.IOException: > Connection reset by peer > > at > org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:161) > > at > org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:158) > > at > org.apache.accumulo.core.client.impl.ThriftTransportPool$CachedTTransport.flush(ThriftTransportPool.java:299) > > at > org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.send_loadTablet(TabletClientService.java:653) > > at > org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.loadTablet(TabletClientService.java:640) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > > at java.lang.reflect.Method.invoke(Unknown Source) > > at > org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$2.invoke(TraceWrap.java:84) > > at com.sun.proxy.$Proxy4.loadTablet(Unknown Source) > > at > org.apache.accumulo.server.master.LiveTServerSet$TServerConnection.assignTablet(LiveTServerSet.java:86) > > at > org.apache.accumulo.server.master.Master$TabletGroupWatcher.flushChanges(Master.java:1818) > > at > org.apache.accumulo.server.master.Master$TabletGroupWatcher.run(Master.java:1426) > > Caused by: java.io.IOException: Connection reset by peer > > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > > at sun.nio.ch.SocketDispatcher.write(Unknown Source) > > at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source) > > at sun.nio.ch.IOUtil.write(Unknown Source) > > at sun.nio.ch.SocketChannelImpl.write(Unknown Source) > > at > org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) > > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > > at java.io.BufferedOutputStream.flushBuffer(Unknown Source) > > at java.io.BufferedOutputStream.flush(Unknown Source) > > at > org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:159) > > ... 13 more > > > > Here are the error logs for Tablet Server #1: > > > > 2014-03-05 11:17:15,152 [tabletserver.TabletServer] INFO : Tablet server > starting on 172.16.111.3 > > 2014-03-05 11:17:15,187 [util.FileSystemMonitor] INFO : Filesystem monitor > started > > 2014-03-05 11:17:15,194 [tabletserver.NativeMap] INFO : Loaded native map > shared library > /opt/accumulo/accumulo/lib/native/map/libNativeMap-Linux-amd64-64.so > > 2014-03-05 11:17:15,499 [tabletserver.TabletServer] INFO : port = 9997 > > 2014-03-05 11:17:15,540 [tabletserver.TabletServer] INFO : Waiting for > tablet server lock > > 2014-03-05 11:17:16,633 [tabletserver.TabletServer] WARN : Got loadTablet > message from master before lock acquired, ignoring... > > 2014-03-05 11:17:16,634 [server.TNonblockingServer] ERROR: Unexpected > exception while invoking! > > java.lang.RuntimeException: Lock not acquired > > at > org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.checkPermission(TabletServer.java:1782) > > at > org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.loadTablet(TabletServer.java:1814) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > > at java.lang.reflect.Method.invoke(Unknown Source) > > at > org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:59) > > at com.sun.proxy.$Proxy1.loadTablet(Unknown Source) > > at > org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$loadTablet.process(TabletClientService.java:2510) > > at > org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor.process(TabletClientService.java:2053) > > at > org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:154) > > at > org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631) > > at > org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:202) > > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > > at > org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) > > at java.lang.Thread.run(Unknown Source) > > 2014-03-05 11:17:20,564 [tabletserver.TabletServer] INFO : Waiting for > tablet server lock > > 2014-03-05 11:17:25,589 [tabletserver.TabletServer] INFO : Waiting for > tablet server lock > > > > (continues until too many retries, then exits) > > > > Tablet Server #2's logs get as far as this (below), and then just stop. > > > > 2014-03-05 11:17:14,112 [tabletserver.TabletServer] INFO : Tablet server > starting on 172.16.111.3 > > 2014-03-05 11:17:14,149 [util.FileSystemMonitor] INFO : Filesystem monitor > started > > 2014-03-05 11:17:14,157 [tabletserver.NativeMap] INFO : Loaded native map > shared library > /opt/accumulo/accumulo/lib/native/map/libNativeMap-Linux-amd64-64.so > > 2014-03-05 11:17:14,481 [tabletserver.TabletServer] INFO : port = 9997 > > > > Also, the master logs interestingly never make any calls to Tablet #2's IP > address. > > > > Any thoughts? We have another cluster that is setup identically in just > about every way (besides hostnames), but it has never experienced any of > these issues. My research shows that these issues can exist within 1.4.3, > which we were using at first, but we switched to 1.4.4 because these types > of issues were supposed to be resolved. Any help would be greatly > appreciated. > > > > Thanks, > > > > Alex Lee > --001a11c131b815b78b04f3e0230c Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
On the monitor page, there's a box that shows your zoo= keepers and their status.  What does it say?

-Eric<= /div>



On Wed, Mar 5, 2014 at 1:09 PM, Alex Lee <alee@orbistechnologies.= com> wrote:

Dfs permissions is cur= rently disabled. I’m using the accumulo user for “accumulo init= ” and for “start-all.sh”, and it is also the user that ha= s passwordless SSH enabled.

 

I ran “hadoop fs= –ls /accumulo” as the accumulo user on both tablet servers, an= d I am able to see inside of the /accumulo directory on hdfs.=

 

Alex

 

From: Ott, Charlie H. [mailto:CHARLES.H.OTT@leidos.com= ]
Sent: Wednesday, March 05, 2014 1:02 PM
To: us= er@accumulo.apache.org
Subject: RE: Tablet server stuck waiting for lock

 

The connection reset b= y peer from the Master in combination with the lock not acquired by the tab= let server makes me wonder if the process owner for the tablet server is ab= le to access HDFS correctly.

 

Are dfs permissions en= abled on your HDFS?  It makes me think the tablet server does not have= permissions to read from the /accumulo path that was initialized on the ma= ster.  Did you use the same account for ‘accumulo init’ ?

 

 

 

From: user-return-3823-CHARLES.H.OTT=3Dleidos.com@accum= ulo.apache.org [mailto:user-return-3823-CHARL= ES.H.OTT=3Dleidos.com@accumulo.apache.org] On Behalf Of Alex Lee
Sent: Wednesday, March 05, 2014 12:17 PM
To: us= er@accumulo.apache.org
Subject: Tablet server stuck waiting for lock

 

Hello,

 

I’m trying to create a virtualized Accumulo 1.= 4.4 cluster with 4 tablet servers using Hadoop 0.20.2 and ZooKeeper 3.3.5. = It didn’t seem to be working correctly with 4 tablet servers, so I fi= rst tried just running with one tablet server, which seemed to work fine. When I tried to run it with just 2 tablet serve= rs, I ran into some issues.

 

Just to preface, I double checked configs within zoo= keeper and accumulo, and everything matches. All hostnames are resolving co= rrectly, and passwordless SSH for the accumulo user is also functional betw= een all nodes. Running “echo stat | nc <zk-server> <zk port>” responds appropriately.

 

Here’s the first error log for the Tablet Mast= er:

 

2014-03-05 11:18:16,626 [= master.Master] ERROR: Error processing table state for store Root Tablet

org.apache.thrift.transpo= rt.TTransportException: java.io.IOException: Connection reset by peer

    &= nbsp;   at org.apache.thrift.transport.TIOStreamTransport.flush(T= IOStreamTransport.java:161)<= /span>

    &= nbsp;   at org.apache.thrift.transport.TFramedTransport.flush(TFr= amedTransport.java:158)

    &= nbsp;   at org.apache.accumulo.core.client.impl.ThriftTransportPo= ol$CachedTTransport.flush(ThriftTransportPool.java:299)

    &= nbsp;   at org.apache.accumulo.core.tabletserver.thrift.TabletCli= entService$Client.send_loadTablet(TabletClientService.java:653)

    &= nbsp;   at org.apache.accumulo.core.tabletserver.thrift.TabletCli= entService$Client.loadTablet(TabletClientService.java:640)

    &= nbsp;   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Me= thod)

    &= nbsp;   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown So= urce)

    &= nbsp;   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknow= n Source)

    &= nbsp;   at java.lang.reflect.Method.invoke(Unknown Source)

    &= nbsp;   at org.apache.accumulo.cloudtrace.instrument.thrift.Trace= Wrap$2.invoke(TraceWrap.java:84)=

    &= nbsp;   at com.sun.proxy.$Proxy4.loadTablet(Unknown Source)

    &= nbsp;   at org.apache.accumulo.server.master.LiveTServerSet$TServ= erConnection.assignTablet(LiveTServerSet.java:86)

    &= nbsp;   at org.apache.accumulo.server.master.Master$TabletGroupWa= tcher.flushChanges(Master.java:1818)

    &= nbsp;   at org.apache.accumulo.server.master.Master$TabletGroupWa= tcher.run(Master.java:1426)<= /span>

Caused by: java.io.IOExce= ption: Connection reset by peer<= /u>

    &= nbsp;   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)

    &= nbsp;   at sun.nio.ch.SocketDispatcher.write(Unknown Source)

    &= nbsp;   at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source= )

    &= nbsp;   at sun.nio.ch.IOUtil.write(Unknown Source)

    &= nbsp;   at sun.nio.ch.SocketChannelImpl.write(Unknown Source)

    &= nbsp;   at org.apache.hadoop.net.SocketOutputStream$Writer.perfor= mIO(SocketOutputStream.java:55)<= /u>

    &= nbsp;   at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketI= OWithTimeout.java:142)

    &= nbsp;   at org.apache.hadoop.net.SocketOutputStream.write(SocketO= utputStream.java:146)=

    &= nbsp;   at org.apache.hadoop.net.SocketOutputStream.write(SocketO= utputStream.java:107)=

    &= nbsp;   at java.io.BufferedOutputStream.flushBuffer(Unknown Sourc= e)

    &= nbsp;   at java.io.BufferedOutputStream.flush(Unknown Source)

    &= nbsp;   at org.apache.thrift.transport.TIOStreamTransport.flush(T= IOStreamTransport.java:159)<= /span>

    &= nbsp;   ... 13 more

 

Here are the error logs for Tablet Server #1:=

 

2014-03-05 11:17:15,152 [= tabletserver.TabletServer] INFO : Tablet server starting on 172.16.111.3

2014-03-05 11:17:15,187 [= util.FileSystemMonitor] INFO : Filesystem monitor started

2014-03-05 11:17:15,194 [= tabletserver.NativeMap] INFO : Loaded native map shared library /opt/accumu= lo/accumulo/lib/native/map/libNativeMap-Linux-amd64-64.so

2014-03-05 11:17:15,499 [= tabletserver.TabletServer] INFO : port =3D 9997

2014-03-05 11:17:15,540 [= tabletserver.TabletServer] INFO : Waiting for tablet server lock

2014-03-05 11:17:16,633 [= tabletserver.TabletServer] WARN : Got loadTablet message from master before= lock acquired, ignoring...<= /span>

2014-03-05 11:17:16,634 [= server.TNonblockingServer] ERROR: Unexpected exception while invoking!

java.lang.RuntimeExceptio= n: Lock not acquired<= /span>

    &= nbsp;   at org.apache.accumulo.server.tabletserver.TabletServer$T= hriftClientHandler.checkPermission(TabletServer.java:1782)

    &= nbsp;   at org.apache.accumulo.server.tabletserver.TabletServer$T= hriftClientHandler.loadTablet(TabletServer.java:1814)

    &= nbsp;   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Me= thod)

    &= nbsp;   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown So= urce)

    &= nbsp;   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknow= n Source)

    &= nbsp;   at java.lang.reflect.Method.invoke(Unknown Source)

    &= nbsp;   at org.apache.accumulo.cloudtrace.instrument.thrift.Trace= Wrap$1.invoke(TraceWrap.java:59)=

    &= nbsp;   at com.sun.proxy.$Proxy1.loadTablet(Unknown Source)

    &= nbsp;   at org.apache.accumulo.core.tabletserver.thrift.TabletCli= entService$Processor$loadTablet.process(TabletClientService.java:2510)

    &= nbsp;   at org.apache.accumulo.core.tabletserver.thrift.TabletCli= entService$Processor.process(TabletClientService.java:2053)

    &= nbsp;   at org.apache.accumulo.server.util.TServerUtils$TimedProc= essor.process(TServerUtils.java:154)

    &= nbsp;   at org.apache.thrift.server.TNonblockingServer$FrameBuffe= r.invoke(TNonblockingServer.java:631)

    &= nbsp;   at org.apache.accumulo.server.util.TServerUtils$THsHaServ= er$Invocation.run(TServerUtils.java:202)

    &= nbsp;   at java.util.concurrent.ThreadPoolExecutor.runWorker(Unkn= own Source)

    &= nbsp;   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unk= nown Source)

    &= nbsp;   at org.apache.accumulo.core.util.LoggingRunnable.run(Logg= ingRunnable.java:34)<= /span>

    &= nbsp;   at java.lang.Thread.run(Unknown Source)

2014-03-05 11:17:20,564 [= tabletserver.TabletServer] INFO : Waiting for tablet server lock

2014-03-05 11:17:25,589 [= tabletserver.TabletServer] INFO : Waiting for tablet server lock

 

(continues until too many= retries, then exits)=

 

Tablet Server #2’s logs get as far as this (be= low), and then just stop.

 

2014-03-05 11:17:14,112 [= tabletserver.TabletServer] INFO : Tablet server starting on 172.16.111.3

2014-03-05 11:17:14,149 [= util.FileSystemMonitor] INFO : Filesystem monitor started

2014-03-05 11:17:14,157 [= tabletserver.NativeMap] INFO : Loaded native map shared library /opt/accumu= lo/accumulo/lib/native/map/libNativeMap-Linux-amd64-64.so

2014-03-05 11:17:14,481 [= tabletserver.TabletServer] INFO : port =3D 9997

 

Also, the master logs interestingly never make any c= alls to Tablet #2’s IP address.

 

Any thoughts? We have another cluster that is setup = identically in just about every way (besides hostnames), but it has never e= xperienced any of these issues. My research shows that these issues can exi= st within 1.4.3, which we were using at first, but we switched to 1.4.4 because these types of issues were supp= osed to be resolved. Any help would be greatly appreciated.

 

Thanks,

 

Alex Lee


--001a11c131b815b78b04f3e0230c--