accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Busbey <bus...@cloudera.com>
Subject Re: Losing tservers - Unusually high Last Contact times
Date Tue, 20 May 2014 02:58:35 GMT
Another thing to check on the zookeeper servers is the iowait times for
whatever virtual disk the ZK transaction log is using.

-- 
Sean
On May 19, 2014 8:20 PM, "Keith Turner" <keith@deenlo.com> wrote:

>
>
>
> On Mon, May 19, 2014 at 6:56 PM, <dlmarion@comcast.net> wrote:
>
>> You are hitting the zookeeper timeout, default 30s I believe. You said you
>> are not oversubscribed for memory, but what about CPU? Are you running
>> YARN
>> processes on the same nodes as the tablet servers? Is the tablet server
>> being pushed into swap or starved of CPU?
>>
>
> Also check on the zookeeper server nodes.  Is Java GC pausing tservers or
> zookeeper servers?
>
>
>>
>> -----Original Message-----
>> From: thomasa [mailto:thomas@ccri.com]
>> Sent: Monday, May 19, 2014 4:22 PM
>> To: user@accumulo.apache.org
>> Subject: Losing tservers - Unusually high Last Contact times
>>
>> Hello all,
>>
>> I am having issues with tablet servers going down due to poor contact
>> times
>> (my hypothesis at least). In the past I have had stability success with
>> smaller clouds (20-40 nodes), but have run into issues with a larger
>> number
>> of nodes (150+). Each node is a datanode, nodemanger, and tablet server.
>> There is a master node that is running the hadoop namenode, hadoop
>> resource
>> manager and accumulo master, monitor, etc. There are three zookeeper
>> nodes.
>> All nodes are vms. This same setup is used on the smaller, stable clouds
>> as
>> well.
>>
>> I do not believe memory allocation is an issue as I have only given
>> hadoop/yarn (2.2.0) and accumulo (1.5.1) less than half of the available
>> memory. The FATAL errors I have seen are:
>>
>> Lost tablet server lock (resaon = SESSION_EXPIRED), exiting
>>
>> Lost ability to monitor tablet server lock, exiting
>>
>> Other than bumping up rpc timeout (which I have done but would rather not
>> do
>> that and find the root cause of the problem), I have run out of ideas on
>> how
>> to solve this issue.
>>
>> Does anyone have any insight into why I would be seeing such bad response
>> times between nodes? Are there any configuration parameters I can play
>> with
>> to fix this?
>>
>> I realize this is a very general question, so let me know if there is any
>> information I can provide to help clarify the issue.
>>
>> Thank you in advance for your time.
>>
>> Thomas
>>
>>
>>
>> --
>> View this message in context:
>>
>> http://apache-accumulo.1065345.n5.nabble.com/Losing-tservers-Unusually-high-
>> Last-Contact-times-tp9950.html<http://apache-accumulo.1065345.n5.nabble.com/Losing-tservers-Unusually-high-Last-Contact-times-tp9950.html>
>> Sent from the Users mailing list archive at Nabble.com.
>>
>>
>

Mime
View raw message