accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <eric.new...@gmail.com>
Subject Re: Bulk ingest losing tablet server
Date Mon, 13 Jan 2014 20:31:32 GMT
Right... you will want to make sure everything fits comfortably in RAM.

In some larger deployments, some users have separated pre-processing into
its own cluster so the servers could be pushed very hard and not affect
queries.

You can increase the zk timeout up to about 40 seconds.  After that you
have to configure zookeeper to have a longer base "tick time."  If you do
increase the timeout to something like a minute, it is going to take a full
minute before lost servers are detected.

-Eric



On Mon, Jan 13, 2014 at 1:02 PM, Anthony F <afccri@gmail.com> wrote:

> Yes, system swappiness is set to 0.  I'll run again and gather more logs.
>
> Is there a zookeeper timeout setting that I can adjust to avoid this issue
> and is that advisable?  Basically, the tservers are colocated with HDFS
> datanodes and Hadoop nodemanagers.  The machines are overallocated in terms
> of RAM.  So, I have a feeling that when a map-reduce job is kicked off, it
> causes the tserver to page out to swap space.  Once the map-reduce job
> finishes and the bulk ingest is kicked off, the tserver is paged back in
> and the ZK timeout causes a shutdown.
>
>
> On Mon, Jan 13, 2014 at 9:19 AM, Eric Newton <eric.newton@gmail.com>wrote:
>
>> We would need to see a little bit more of the logs prior to the error.
>>  The tablet server is losing its connection to zookeeper.
>>
>> I have seen problems like this when a tablet server has been pushed into
>> swap.  When the server is tasked to do work, it begins to use the swapped
>> out memory, and the process is paused while the pages are swapped back in.
>>
>> The pauses prevent the zookeeper client API from sending keep-alive
>> messages to zookeeper, so zookeeper thinks the process has died, and the
>> tablet server loses its lock.
>>
>> Have you changed your system's swappiness to zero as outlined in the
>> README?
>>
>> Check the debug lines containing "gc" and verify the server has plenty of
>> free space.
>>
>> -Eric
>>
>>
>> On Mon, Jan 13, 2014 at 8:11 AM, Anthony F <afccri@gmail.com> wrote:
>>
>>> I am experiencing an issue when bulk importing the results of a
>>> mapreduce job of losing one or more tservers.  After the job is finished
>>> and the bulk import is kicked off, I observe the following in the lost
>>> tserver's logs:
>>>
>>> 2014-01-10 23:14:21,312 [zookeeper.DistributedWorkQueue] INFO : Got
>>> unexpected zookeeper event: None for
>>> /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/recover
>>> y
>>> 2014-01-10 23:14:21,312 [zookeeper.DistributedWorkQueue] INFO : Got
>>> unexpected zookeeper event: None for
>>> /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/bulk_failed_copyq
>>> 2014-01-10 23:14:21,369 [zookeeper.DistributedWorkQueue] ERROR: Failed
>>> to look for work
>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>> KeeperErrorCode = ConnectionLoss for
>>> /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/bulk_failed_copyq
>>>
>>> However, the bulk import actually succeeded and all is well with the
>>> data in the table.  I have to restart the tserver each time this happens
>>> which is not a viable solution for production.
>>>
>>> I am using Accumulo 1.5.0.  Tservers have 12G of RAM and index caching,
>>> CF bloom filters, and groups are turned on for the table in question.  Any
>>> ideas why this might be happening?
>>>
>>> Thanks,
>>> Anthony
>>>
>>
>>
>

Mime
View raw message