accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: Bulk ingest losing tablet server
Date Mon, 13 Jan 2014 19:31:21 GMT
You can alter instance.zookeeper.timeout in accumulo.

Defaults to 30 seconds. You can override by specifying in 
accumulo-site.xml or by using `config -s instance.zookeeper.timeout=60s` 
in the Accumulo shell.

Beware that this will potentially make your system less responsive to 
these failures (the amount of time for Accumulo to notice failure, 
assign and recover will increase with your new timeout).

As far as logs go, you should be able to see something near the end of 
the tserver*.debug.log file that lets you know that the tabletserver 
lost its lock. You shouldn't have to dig really hard if this is the case.

On 1/13/14, 1:02 PM, Anthony F wrote:
> Yes, system swappiness is set to 0.  I'll run again and gather more logs.
> Is there a zookeeper timeout setting that I can adjust to avoid this
> issue and is that advisable?  Basically, the tservers are colocated with
> HDFS datanodes and Hadoop nodemanagers.  The machines are overallocated
> in terms of RAM.  So, I have a feeling that when a map-reduce job is
> kicked off, it causes the tserver to page out to swap space.  Once the
> map-reduce job finishes and the bulk ingest is kicked off, the tserver
> is paged back in and the ZK timeout causes a shutdown.
> On Mon, Jan 13, 2014 at 9:19 AM, Eric Newton <
> <>> wrote:
>     We would need to see a little bit more of the logs prior to the
>     error.  The tablet server is losing its connection to zookeeper.
>     I have seen problems like this when a tablet server has been pushed
>     into swap.  When the server is tasked to do work, it begins to use
>     the swapped out memory, and the process is paused while the pages
>     are swapped back in.
>     The pauses prevent the zookeeper client API from sending keep-alive
>     messages to zookeeper, so zookeeper thinks the process has died, and
>     the tablet server loses its lock.
>     Have you changed your system's swappiness to zero as outlined in the
>     README?
>     Check the debug lines containing "gc" and verify the server has
>     plenty of free space.
>     -Eric
>     On Mon, Jan 13, 2014 at 8:11 AM, Anthony F <
>     <>> wrote:
>         I am experiencing an issue when bulk importing the results of a
>         mapreduce job of losing one or more tservers.  After the job is
>         finished and the bulk import is kicked off, I observe the
>         following in the lost tserver's logs:
>         2014-01-10 23:14:21,312 [zookeeper.DistributedWorkQueue] INFO :
>         Got unexpected zookeeper event: None for
>         /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/recover
>         y
>         2014-01-10 23:14:21,312 [zookeeper.DistributedWorkQueue] INFO :
>         Got unexpected zookeeper event: None for
>         /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/bulk_failed_copyq
>         2014-01-10 23:14:21,369 [zookeeper.DistributedWorkQueue] ERROR:
>         Failed to look for work
>         org.apache.zookeeper.KeeperException$ConnectionLossException:
>         KeeperErrorCode = ConnectionLoss for
>         /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/bulk_failed_copyq
>         However, the bulk import actually succeeded and all is well with
>         the data in the table.  I have to restart the tserver each time
>         this happens which is not a viable solution for production.
>         I am using Accumulo 1.5.0.  Tservers have 12G of RAM and index
>         caching, CF bloom filters, and groups are turned on for the
>         table in question. Any ideas why this might be happening?
>         Thanks,
>         Anthony

View raw message