accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Lynch <patricklync...@aim.com>
Subject Walog Recovery Killing Loggers
Date Mon, 30 Jul 2012 22:00:04 GMT

I'm having a problem recovering from an improper shutdown of a tablet server.

 


Originally, the tablet server was giving me warnings about being low on memory, so I wanted
to update it's memory settings and restart it. Before I did anything, the server worked fine
running the tablet server and logger -- it warned about low memory but still operated.


After editing the configuration, I called stop-here.sh in the machine the server was running
on, which stopped the tablet server and logger processes. Calling start-here.sh, however,
did not do anything, and calling start-all.sh on the master server started the tablet server
and logger processes but the server still appeared offline on the monitoring webpage. Eventually,
I undid the configuration changes I had made and I was able to start the process by manually
killing the tablet and logger processes and calling start-all.sh again, but then a new problem
arose.


The tablet server was now online, but its walog still needed to be recovered. Then, when the
recovery began, it started the copy/sort process on the walogs of not only the server that
was offline but another server (which I have discovered has a walog with the same contents
as the walog of the previously offline tablet but a different name). As soon as the recovery
process starts, the loggers of the two servers go offline, and the recovery process lingers
without advancing progress until the master gives up once the maximum recovery time is reached.
When the loggers are offline, I am able to bring them online again by calling start-all.sh
on the master, but it does not affect the progress of any current recovery and they go offline
again once the next recovery is attempted.


Log files seem to reveal the core error at hand: the logger_server-address.in-addr.arpa.out
log says that an OutOfMemory java heap space error has occurred and that the pid of the logger
has been killed. This raises the question: how could the server have had enough room for these
processes before but not now? Using a monitoring service (ganglia) shows that one server still
has 1 MB of memory free and the other 7 MB when the tablet and logger servers are killed at
the time of recovery. 


Is the solution to this to allocate more heap space to java, to change the Accumulo memory
configurations, or something else? Of the machines in our cluster, all run CentOS 6.2, some
x86 and some x86_64. The two servers in question are x86_64 machines, so there are other x86_64
machines with the same configurations as these two but have shown no problems.


Thanks for working to help me understand this,


Patrick Lynch


 

Mime
View raw message