accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Lynch <>
Subject Walog Recovery Killing Loggers
Date Mon, 30 Jul 2012 22:00:04 GMT

I'm having a problem recovering from an improper shutdown of a tablet server.


Originally, the tablet server was giving me warnings about being low on memory, so I wanted
to update it's memory settings and restart it. Before I did anything, the server worked fine
running the tablet server and logger -- it warned about low memory but still operated.

After editing the configuration, I called in the machine the server was running
on, which stopped the tablet server and logger processes. Calling, however,
did not do anything, and calling on the master server started the tablet server
and logger processes but the server still appeared offline on the monitoring webpage. Eventually,
I undid the configuration changes I had made and I was able to start the process by manually
killing the tablet and logger processes and calling again, but then a new problem

The tablet server was now online, but its walog still needed to be recovered. Then, when the
recovery began, it started the copy/sort process on the walogs of not only the server that
was offline but another server (which I have discovered has a walog with the same contents
as the walog of the previously offline tablet but a different name). As soon as the recovery
process starts, the loggers of the two servers go offline, and the recovery process lingers
without advancing progress until the master gives up once the maximum recovery time is reached.
When the loggers are offline, I am able to bring them online again by calling
on the master, but it does not affect the progress of any current recovery and they go offline
again once the next recovery is attempted.

Log files seem to reveal the core error at hand: the
log says that an OutOfMemory java heap space error has occurred and that the pid of the logger
has been killed. This raises the question: how could the server have had enough room for these
processes before but not now? Using a monitoring service (ganglia) shows that one server still
has 1 MB of memory free and the other 7 MB when the tablet and logger servers are killed at
the time of recovery. 

Is the solution to this to allocate more heap space to java, to change the Accumulo memory
configurations, or something else? Of the machines in our cluster, all run CentOS 6.2, some
x86 and some x86_64. The two servers in question are x86_64 machines, so there are other x86_64
machines with the same configurations as these two but have shown no problems.

Thanks for working to help me understand this,

Patrick Lynch


View raw message