accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: slave tserver not responding
Date Wed, 01 Jan 2014 23:35:51 GMT
Ok -- turned out to be a couple of little things, with one big one :D

The big one -- iptables was still running on the slave :)

I noticed that you were getting the same noroutetohost exceptions coming 
from the datanode logs trying to replicate, so I assume there was 
something outside of Hadoop. A `telnet slave_ip_addr port` on with the 
information that was showing up in the stack trace verified that I 
indeed could not. IPtables had an exception for SSH, so that's why 
SSH'ing still worked and Arshak could start the processes.

Small things:

It looked like IPv6 was still running via ifconfig, I disabled those via 
procfs and disabled them permanently via sysctl. That would have likely 
caused more trouble but I noticed this before iptables.

Max open files was still at 1024, which was likely to cause you more 
problems. I just upped them for the user you run Accumulo as.

- Josh

On 1/1/14, 2:28 PM, Josh Elser wrote:
> Sure -- you have my address already.
>
> Also, nc not working while the tabletserver is dead makes sense (that
> process is what's listening on that port). Once the process dies,
> there's nothing else listening.
>
> On 1/1/2014 1:31 PM, Arshak Navruzyan wrote:
>> If anyone wants to look at my live environment please let me know (your
>> gmail) and I will add you to the Google Compute Engine.  Thanks!
>>
>>
>> On Wed, Jan 1, 2014 at 7:58 AM, Arshak Navruzyan <arshakn@gmail.com
>> <mailto:arshakn@gmail.com>> wrote:
>>
>>     Sean
>>
>>     Thanks for looking into the log files.
>>
>>     These are two Google compute engine instance under the same project
>>     so there shouldn't be any firewall between them.
>>
>>     For the brief moment that the slave runs during startup, I can nc
>>     into port 9997 from the master to the slave.  But after it crashes,
>>     I can't.  Seems like somehow the problem is on the slave.
>>
>>     Arshak
>>
>>     On Dec 31, 2013 11:58 PM, "Sean Busbey" <busbey+ml@clouderagovt.com
>>     <mailto:busbey%2Bml@clouderagovt.com>> wrote:
>>
>>         Well, I can tell you the proximal cause.  the tserver log shows
>>         that it starts normally, then exits because it's told to (via
>>         the zookeeper lock being removed).
>>
>>         If you look at the master debug logs, this happens because the
>>         master fails in three attempts to talk to the tserver, all with
>>         the same error:
>>
>>         2014-01-01 06:17:20,231 [master.Master] ERROR: unable to get
>>         tablet server status 10.240.203.36:9997[1434c70ed30001b]
>>         org.apache.thrift.transport.TTransportException:
>>         java.net.NoRouteToHostException: No route to host
>>
>>         Unfortunately, this is the same error you noticed in your first
>>         email. After 3 of those, the master deletes the zk lock so that
>>         the tserver will shutdown.
>>
>>         Could there be another firewall blocking access to port 9997 on
>>         the worker machine from the master machine?
>>
>>         Check from the master (you'll need netcat):
>>
>>         $ nc -z 10.240.203.36 9997
>>         $ echo $?
>>
>>
>>
>>
>>
>>         On Wed, Jan 1, 2014 at 12:33 AM, Arshak Navruzyan
>>         <arshakn@gmail.com <mailto:arshakn@gmail.com>> wrote:
>>
>>             I am probably missing something really basic so I posted
>>             both the master and the slave log files:
>>
>>             https://www.dropbox.com/sh/liv1mzuohyiv6uu/X5kx7AZJ6i
>>
>>             Thanks again to everyone for the help!
>>
>>
>>             On Tue, Dec 31, 2013 at 10:20 PM, Arshak Navruzyan
>>             <arshakn@gmail.com <mailto:arshakn@gmail.com>> wrote:
>>
>>                 disabled selinux (iptables already off) on both master
>>                 and slave but didn't make a difference unfortunately.
>>
>>
>>
>>                 On Tue, Dec 31, 2013 at 9:25 PM, Kurt Christensen
>>                 <hoodel@hoodel.com <mailto:hoodel@hoodel.com>> wrote:
>>
>>
>>                     SELINUX disabled? IPTABLES configured? I have
>>                     nothing else.
>>
>>                     Kurt
>>
>>                     ------
>>
>>
>>                     On 12/31/13 6:02 PM, Arshak Navruzyan wrote:
>>
>>                         I configured a new instance with a master and a
>>                         slave tserver.  When I do start-all on the
>>                         master, the slave doesn't come up.  I am
>>                         wondering if it's because I left the instance
>>                         secret as the default. (I get an exception when
>>                         I try to change that).
>>
>>                         This is what I see in the master's monitor
>>                         regarding the slave
>>
>>                              Non-Functioning Tablet Servers
>>                              The following tablet servers reported a
>>                         status other than Online
>>
>>                         10.240.203.36:9997 <http://10.240.203.36:9997>
>>                         <http://10.240.203.36:9997>  UNRESPONSIVE
>>
>>
>>
>>                         In the master log I see the following
>>
>>                              2013-12-31 22:56:13,665 [master.Master]
>>                         ERROR: unable to get
>>                              tablet server status
>>                         10.240.203.36:9997[__1434a79d34404a2]
>>
>>
>> org.apache.thrift.transport.__TTransportException:
>>                         java.net
>>                         <http://java.net>.__NoRouteToHostException: No
>>                         route to host
>>                              2013-12-31 22:56:13,712 [master.Master]
>>                         ERROR: unable to get
>>                              tablet server status
>>                         10.240.203.36:9997[__1434a79d34404a2]
>>
>>
>> org.apache.thrift.transport.__TTransportException:
>>                         java.net
>>                         <http://java.net>.__NoRouteToHostException: No
>>                         route to host
>>                              2013-12-31 22:56:13,802
>>                         [balancer.TableLoadBalancer] INFO : Loaded
>>                              class
>>
>>
>> org.apache.accumulo.server.__master.balancer.__DefaultLoadBalancer
>>                         for
>>                              table !0
>>                              2013-12-31 22:56:13,803 [master.Master]
>>                         INFO : Assigning 1 tablets
>>                              2013-12-31 22:56:13,812 [master.Master]
>>                         ERROR: Error processing
>>                              table state for store Root Tablet
>>
>>
>> org.apache.thrift.transport.__TTransportException:
>>                         java.net
>>                         <http://java.net>.__NoRouteToHostException: No
>>                         route to host
>>                                      at
>>
>>
>> org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__createNewTransport(__ThriftTransportPool.java:475)
>>
>>                                      at
>>
>>
>> org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__getTransport(__ThriftTransportPool.java:464)
>>
>>                                      at
>>
>>
>> org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__getTransport(__ThriftTransportPool.java:441)
>>
>>                                      at
>>
>>
>> org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__getTransportWithDefaultTimeout__(ThriftTransportPool.java:366)
>>
>>
>>
>>
>>                         In the slave's tserver.log all I see is
>>
>>                              2013-12-31 22:56:34,731
>>                         [tabletserver.TabletServer] FATAL: Lost
>>                              tablet server lock (reason = LOCK_DELETED),
>>                         exiting.
>>
>>
>>                     --
>>
>>                     Kurt Christensen
>>                     P.O. Box 811
>>                     Westminster, MD 21158-0811
>>
>>
>> ------------------------------__------------------------------__------------
>>
>>                     If you can't explain it simply, you don't understand
>>                     it well enough. -- Albert Einstein
>>
>>
>>
>>
>>
>>
>>         --
>>         Sean
>>
>>

Mime
View raw message