accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <dlmar...@comcast.net>
Subject RE: Losing tservers - Unusually high Last Contact times
Date Mon, 19 May 2014 22:56:30 GMT
You are hitting the zookeeper timeout, default 30s I believe. You said you
are not oversubscribed for memory, but what about CPU? Are you running YARN
processes on the same nodes as the tablet servers? Is the tablet server
being pushed into swap or starved of CPU?

-----Original Message-----
From: thomasa [mailto:thomas@ccri.com] 
Sent: Monday, May 19, 2014 4:22 PM
To: user@accumulo.apache.org
Subject: Losing tservers - Unusually high Last Contact times

Hello all,

I am having issues with tablet servers going down due to poor contact times
(my hypothesis at least). In the past I have had stability success with
smaller clouds (20-40 nodes), but have run into issues with a larger number
of nodes (150+). Each node is a datanode, nodemanger, and tablet server.
There is a master node that is running the hadoop namenode, hadoop resource
manager and accumulo master, monitor, etc. There are three zookeeper nodes.
All nodes are vms. This same setup is used on the smaller, stable clouds as
well. 

I do not believe memory allocation is an issue as I have only given
hadoop/yarn (2.2.0) and accumulo (1.5.1) less than half of the available
memory. The FATAL errors I have seen are:

Lost tablet server lock (resaon = SESSION_EXPIRED), exiting

Lost ability to monitor tablet server lock, exiting

Other than bumping up rpc timeout (which I have done but would rather not do
that and find the root cause of the problem), I have run out of ideas on how
to solve this issue. 

Does anyone have any insight into why I would be seeing such bad response
times between nodes? Are there any configuration parameters I can play with
to fix this?

I realize this is a very general question, so let me know if there is any
information I can provide to help clarify the issue.

Thank you in advance for your time.

Thomas



--
View this message in context:
http://apache-accumulo.1065345.n5.nabble.com/Losing-tservers-Unusually-high-
Last-Contact-times-tp9950.html
Sent from the Users mailing list archive at Nabble.com.


Mime
View raw message