accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Wall <mjw...@gmail.com>
Subject Re: slave tserver not responding
Date Wed, 01 Jan 2014 20:46:02 GMT
I don't know if it helps debugging, but I am seeing the following in
tserver_shrine.log

2014-01-01 06:15:37,852 [hdfs.DFSClient] INFO : Exception in
createBlockOutputStream 10.240.165.43:50010 java.io.IOException: Bad
connect ack with firstBadLink as 10.240.203.36:50010
2014-01-01 06:15:37,852 [hdfs.DFSClient] INFO : Abandoning block
blk_-2756969025267118869_1348
2014-01-01 06:15:37,855 [hdfs.DFSClient] INFO : Excluding datanode
10.240.203.36:50010
2014-01-01 06:15:38,147 [hdfs.DFSClient] INFO : Exception in
createBlockOutputStream 10.240.165.43:50010 java.io.IOException: Bad
connect ack with firstBadLink as 10.240.203.36:50010
2014-01-01 06:15:38,148 [hdfs.DFSClient] INFO : Abandoning block
blk_2883724569463729419_1349
2014-01-01 06:15:38,149 [hdfs.DFSClient] INFO : Excluding datanode
10.240.203.36:50010
2014-01-01 06:15:38,554 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:39,559 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:40,565 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:41,571 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:42,578 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:43,586 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:44,594 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)



On Wed, Jan 1, 2014 at 2:28 PM, Josh Elser <josh.elser@gmail.com> wrote:

> Sure -- you have my address already.
>
> Also, nc not working while the tabletserver is dead makes sense (that
> process is what's listening on that port). Once the process dies, there's
> nothing else listening.
>
>
> On 1/1/2014 1:31 PM, Arshak Navruzyan wrote:
>
>> If anyone wants to look at my live environment please let me know (your
>> gmail) and I will add you to the Google Compute Engine.  Thanks!
>>
>>
>> On Wed, Jan 1, 2014 at 7:58 AM, Arshak Navruzyan <arshakn@gmail.com
>> <mailto:arshakn@gmail.com>> wrote:
>>
>>     Sean
>>
>>     Thanks for looking into the log files.
>>
>>     These are two Google compute engine instance under the same project
>>     so there shouldn't be any firewall between them.
>>
>>     For the brief moment that the slave runs during startup, I can nc
>>     into port 9997 from the master to the slave.  But after it crashes,
>>     I can't.  Seems like somehow the problem is on the slave.
>>
>>     Arshak
>>
>>     On Dec 31, 2013 11:58 PM, "Sean Busbey" <busbey+ml@clouderagovt.com
>>     <mailto:busbey%2Bml@clouderagovt.com>> wrote:
>>
>>         Well, I can tell you the proximal cause.  the tserver log shows
>>         that it starts normally, then exits because it's told to (via
>>         the zookeeper lock being removed).
>>
>>         If you look at the master debug logs, this happens because the
>>         master fails in three attempts to talk to the tserver, all with
>>         the same error:
>>
>>         2014-01-01 06:17:20,231 [master.Master] ERROR: unable to get
>>         tablet server status 10.240.203.36:9997[1434c70ed30001b]
>>         org.apache.thrift.transport.TTransportException:
>>         java.net.NoRouteToHostException: No route to host
>>
>>         Unfortunately, this is the same error you noticed in your first
>>         email. After 3 of those, the master deletes the zk lock so that
>>         the tserver will shutdown.
>>
>>         Could there be another firewall blocking access to port 9997 on
>>         the worker machine from the master machine?
>>
>>         Check from the master (you'll need netcat):
>>
>>         $ nc -z 10.240.203.36 9997
>>         $ echo $?
>>
>>
>>
>>
>>
>>         On Wed, Jan 1, 2014 at 12:33 AM, Arshak Navruzyan
>>         <arshakn@gmail.com <mailto:arshakn@gmail.com>> wrote:
>>
>>             I am probably missing something really basic so I posted
>>             both the master and the slave log files:
>>
>>             https://www.dropbox.com/sh/liv1mzuohyiv6uu/X5kx7AZJ6i
>>
>>             Thanks again to everyone for the help!
>>
>>
>>             On Tue, Dec 31, 2013 at 10:20 PM, Arshak Navruzyan
>>             <arshakn@gmail.com <mailto:arshakn@gmail.com>> wrote:
>>
>>                 disabled selinux (iptables already off) on both master
>>                 and slave but didn't make a difference unfortunately.
>>
>>
>>
>>                 On Tue, Dec 31, 2013 at 9:25 PM, Kurt Christensen
>>                 <hoodel@hoodel.com <mailto:hoodel@hoodel.com>> wrote:
>>
>>
>>                     SELINUX disabled? IPTABLES configured? I have
>>                     nothing else.
>>
>>                     Kurt
>>
>>                     ------
>>
>>
>>                     On 12/31/13 6:02 PM, Arshak Navruzyan wrote:
>>
>>                         I configured a new instance with a master and a
>>                         slave tserver.  When I do start-all on the
>>                         master, the slave doesn't come up.  I am
>>                         wondering if it's because I left the instance
>>                         secret as the default. (I get an exception when
>>                         I try to change that).
>>
>>                         This is what I see in the master's monitor
>>                         regarding the slave
>>
>>                              Non-Functioning Tablet Servers
>>                              The following tablet servers reported a
>>                         status other than Online
>>
>>                         10.240.203.36:9997 <http://10.240.203.36:9997>
>>                         <http://10.240.203.36:9997>  UNRESPONSIVE
>>
>>
>>
>>                         In the master log I see the following
>>
>>                              2013-12-31 22:56:13,665 [master.Master]
>>                         ERROR: unable to get
>>                              tablet server status
>>                         10.240.203.36:9997[__1434a79d34404a2]
>>
>>                         org.apache.thrift.transport.__
>> TTransportException:
>>                         java.net
>>                         <http://java.net>.__NoRouteToHostException: No
>>
>>                         route to host
>>                              2013-12-31 22:56:13,712 [master.Master]
>>                         ERROR: unable to get
>>                              tablet server status
>>                         10.240.203.36:9997[__1434a79d34404a2]
>>
>>                         org.apache.thrift.transport.__
>> TTransportException:
>>                         java.net
>>                         <http://java.net>.__NoRouteToHostException: No
>>
>>                         route to host
>>                              2013-12-31 22:56:13,802
>>                         [balancer.TableLoadBalancer] INFO : Loaded
>>                              class
>>
>>                         org.apache.accumulo.server.__master.balancer.__
>> DefaultLoadBalancer
>>
>>                         for
>>                              table !0
>>                              2013-12-31 22:56:13,803 [master.Master]
>>                         INFO : Assigning 1 tablets
>>                              2013-12-31 22:56:13,812 [master.Master]
>>                         ERROR: Error processing
>>                              table state for store Root Tablet
>>
>>                         org.apache.thrift.transport.__
>> TTransportException:
>>                         java.net
>>                         <http://java.net>.__NoRouteToHostException: No
>>                         route to host
>>                                      at
>>
>>                         org.apache.accumulo.core.__client.impl.__
>> ThriftTransportPool.__createNewTransport(__ThriftTransportPool.java:475)
>>                                      at
>>
>>                         org.apache.accumulo.core.__client.impl.__
>> ThriftTransportPool.__getTransport(__ThriftTransportPool.java:464)
>>                                      at
>>
>>                         org.apache.accumulo.core.__client.impl.__
>> ThriftTransportPool.__getTransport(__ThriftTransportPool.java:441)
>>                                      at
>>
>>                         org.apache.accumulo.core.__client.impl.__
>> ThriftTransportPool.__getTransportWithDefaultTimeout
>> __(ThriftTransportPool.java:366)
>>
>>
>>
>>
>>                         In the slave's tserver.log all I see is
>>
>>                              2013-12-31 22:56:34,731
>>                         [tabletserver.TabletServer] FATAL: Lost
>>                              tablet server lock (reason = LOCK_DELETED),
>>                         exiting.
>>
>>
>>                     --
>>
>>                     Kurt Christensen
>>                     P.O. Box 811
>>                     Westminster, MD 21158-0811
>>
>>                     ------------------------------
>> __------------------------------__------------
>>
>>                     If you can't explain it simply, you don't understand
>>                     it well enough. -- Albert Einstein
>>
>>
>>
>>
>>
>>
>>         --
>>         Sean
>>
>>
>>

Mime
View raw message