accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Busbey <bus...@clouderagovt.com>
Subject Re: slave tserver not responding
Date Wed, 01 Jan 2014 03:17:25 GMT
What does the other /etc/hosts file look like?
On Dec 31, 2013 9:14 PM, "Arshak Navruzyan" <arshakn@gmail.com> wrote:

> Josh,
>
> Yea Zookeeper is running on the master and I can connect to it using zkCli
> from the slave.
>
> /etc/hosts looks fine
>
> 127.0.0.1   localhost localhost.localdomain localhost4
> localhost4.localdomain4
> ::1         localhost localhost.localdomain localhost6
> localhost6.localdomain6
> 10.240.203.36 shoki.c.accumulo-test.internal shoki  # Added by Google
>
> Hmm, completely baffled!
>
> Arshak
>
>
> On Tue, Dec 31, 2013 at 6:35 PM, Josh Elser <josh.elser@gmail.com> wrote:
>
>> On 12/31/13, 6:37 PM, Arshak Navruzyan wrote:
>>
>>> Here is my route -n
>>>
>>> Kernel IP routing table
>>> Destination     Gateway         Genmask         Flags Metric Ref    Use
>>> Iface
>>> 10.240.0.1      0.0.0.0         255.255.255.255 UH    0      0        0
>>> eth0
>>> 169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0
>>> eth0
>>> 0.0.0.0         10.240.0.1      0.0.0.0         UG    0      0        0
>>> eth0
>>>
>>>
>>> "slave tserver" is another physical machine (well google compute engine
>>> instance).  Yes one gce instance is running master (and slave) and the
>>> other is running just slave.
>>>
>>> here is my config:
>>>
>>> masters:
>>> 10.240.165.43
>>>
>>> slaves:
>>> 10.240.165.43
>>> 10.240.203.36
>>>
>>> BTW when I run bin/check-slaves conf/slaves
>>> # WRITABLE value not configured, not checking partitions
>>> 10.240.165.43
>>> 10.240.203.36
>>>
>>> Is the master supposed to be listed in the slaves files too?
>>>
>>
>> No, your configuration files look correct.
>>
>> I'm not sure why but for whatever reason, your slave (10.240.203.36)
>> can't talk back to the master (10.240.165.43), but at least that's where
>> you want to look at things. You know that the master can talk to the slave
>> (otherwise the slave tserver would have never started) and that the slave
>> tserver can talk to ZooKeeper (that it had and then lost a lock in ZK). Are
>> you running ZooKeeper on the master (that would further isolate it in
>> debugging this).
>>
>> It may be worthwhile to double check your /etc/hosts entries just to be
>> safe. Aside from that, I can't think of anything else at the moment.
>>
>>
>>> On Tue, Dec 31, 2013 at 3:32 PM, Josh Elser <josh.elser@gmail.com
>>> <mailto:josh.elser@gmail.com>> wrote:
>>>
>>>     Maybe check the output of `route -n` on the master? It might be
>>>     something weird with DNS as well.
>>>
>>>     When you say "slave tserver", are you talking about a separate
>>>     physical machine? You have one node running the Accumulo master and
>>>     another running a tserver?
>>>
>>>
>>>     On 12/31/13, 6:02 PM, Arshak Navruzyan wrote:
>>>
>>>         I configured a new instance with a master and a slave tserver.
>>>           When I
>>>         do start-all on the master, the slave doesn't come up.  I am
>>>         wondering
>>>         if it's because I left the instance secret as the default. (I
>>> get an
>>>         exception when I try to change that).
>>>
>>>         This is what I see in the master's monitor regarding the slave
>>>
>>>              Non-Functioning Tablet Servers
>>>              The following tablet servers reported a status other than
>>>         Online
>>>
>>>         10.240.203.36:9997 <http://10.240.203.36:9997>
>>>         <http://10.240.203.36:9997>  UNRESPONSIVE
>>>
>>>
>>>
>>>         In the master log I see the following
>>>
>>>              2013-12-31 22:56:13,665 [master.Master] ERROR: unable to
>>>         get tablet
>>>              server status 10.240.203.36:9997[__1434a79d34404a2]
>>>              org.apache.thrift.transport.__TTransportException:
>>>         java.net <http://java.net>.__NoRouteToHostException: No route to
>>>
>>>         host
>>>              2013-12-31 22:56:13,712 [master.Master] ERROR: unable to
>>>         get tablet
>>>              server status 10.240.203.36:9997[__1434a79d34404a2]
>>>              org.apache.thrift.transport.__TTransportException:
>>>         java.net <http://java.net>.__NoRouteToHostException: No route to
>>>
>>>         host
>>>              2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO :
>>>         Loaded
>>>              class
>>>         org.apache.accumulo.server.__master.balancer.__
>>> DefaultLoadBalancer
>>>
>>>              for table !0
>>>              2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1
>>>         tablets
>>>              2013-12-31 22:56:13,812 [master.Master] ERROR: Error
>>> processing
>>>              table state for store Root Tablet
>>>              org.apache.thrift.transport.__TTransportException:
>>>         java.net <http://java.net>.__NoRouteToHostException: No route to
>>>         host
>>>                       at
>>>
>>>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__
>>> createNewTransport(__ThriftTransportPool.java:475)
>>>                       at
>>>
>>>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__
>>> getTransport(__ThriftTransportPool.java:464)
>>>                       at
>>>
>>>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__
>>> getTransport(__ThriftTransportPool.java:441)
>>>                       at
>>>
>>>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__
>>> getTransportWithDefaultTimeout__(ThriftTransportPool.java:366)
>>>
>>>
>>>
>>>
>>>         In the slave's tserver.log all I see is
>>>
>>>              2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL:
>>> Lost
>>>              tablet server lock (reason = LOCK_DELETED), exiting.
>>>
>>>
>>>
>

Mime
View raw message