accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <eric.new...@gmail.com>
Subject Re: Bulk ingest losing tablet server
Date Mon, 13 Jan 2014 14:19:05 GMT
We would need to see a little bit more of the logs prior to the error.  The
tablet server is losing its connection to zookeeper.

I have seen problems like this when a tablet server has been pushed into
swap.  When the server is tasked to do work, it begins to use the swapped
out memory, and the process is paused while the pages are swapped back in.

The pauses prevent the zookeeper client API from sending keep-alive
messages to zookeeper, so zookeeper thinks the process has died, and the
tablet server loses its lock.

Have you changed your system's swappiness to zero as outlined in the README?

Check the debug lines containing "gc" and verify the server has plenty of
free space.

-Eric


On Mon, Jan 13, 2014 at 8:11 AM, Anthony F <afccri@gmail.com> wrote:

> I am experiencing an issue when bulk importing the results of a mapreduce
> job of losing one or more tservers.  After the job is finished and the bulk
> import is kicked off, I observe the following in the lost tserver's logs:
>
> 2014-01-10 23:14:21,312 [zookeeper.DistributedWorkQueue] INFO : Got
> unexpected zookeeper event: None for
> /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/recover
> y
> 2014-01-10 23:14:21,312 [zookeeper.DistributedWorkQueue] INFO : Got
> unexpected zookeeper event: None for
> /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/bulk_failed_copyq
> 2014-01-10 23:14:21,369 [zookeeper.DistributedWorkQueue] ERROR: Failed to
> look for work
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for
> /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/bulk_failed_copyq
>
> However, the bulk import actually succeeded and all is well with the data
> in the table.  I have to restart the tserver each time this happens which
> is not a viable solution for production.
>
> I am using Accumulo 1.5.0.  Tservers have 12G of RAM and index caching, CF
> bloom filters, and groups are turned on for the table in question.  Any
> ideas why this might be happening?
>
> Thanks,
> Anthony
>

Mime
View raw message