accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Lynch <>
Subject Re: Walog Recovery Killing Loggers
Date Tue, 31 Jul 2012 17:09:49 GMT
That did the trick, thank you!

-----Original Message-----
From: Eric Newton <>
To: user <>
Sent: Mon, Jul 30, 2012 6:12 pm
Subject: Re: Walog Recovery Killing Loggers

The log recovery process uses more real memory than just logging does.

You can set logger.sort.buffer.size to something smaller than the default 200M to try to get
recovery to complete.

You may also want to use the following incantation to speed recovery and avoid timeouts:

 $ pkill -f =master
 $ hadoop fs -rm /accumulo/recovery/*.failed
 $ ./bin/

This should be run on the master.  The first step kills the master so it can't remember it
was in the recovery process.  The second removes markers that remember failures.  The third
brings the master back to start the recovery process.


On Mon, Jul 30, 2012 at 6:00 PM, Patrick Lynch <> wrote:
> I'm having a problem recovering from an improper shutdown of a tablet
> server.
> Originally, the tablet server was giving me warnings about being low on
> memory, so I wanted to update it's memory settings and restart it. Before I
> did anything, the server worked fine running the tablet server and logger --
> it warned about low memory but still operated.
> After editing the configuration, I called in the machine the
> server was running on, which stopped the tablet server and logger processes.
> Calling, however, did not do anything, and calling
> on the master server started the tablet server and logger
> processes but the server still appeared offline on the monitoring webpage.
> Eventually, I undid the configuration changes I had made and I was able to
> start the process by manually killing the tablet and logger processes and
> calling again, but then a new problem arose.
> The tablet server was now online, but its walog still needed to be
> recovered. Then, when the recovery began, it started the copy/sort process
> on the walogs of not only the server that was offline but another server
> (which I have discovered has a walog with the same contents as the walog of
> the previously offline tablet but a different name). As soon as the recovery
> process starts, the loggers of the two servers go offline, and the recovery
> process lingers without advancing progress until the master gives up once
> the maximum recovery time is reached. When the loggers are offline, I am
> able to bring them online again by calling on the master, but
> it does not affect the progress of any current recovery and they go offline
> again once the next recovery is attempted.
> Log files seem to reveal the core error at hand: the
> log says that an OutOfMemory java
> heap space error has occurred and that the pid of the logger has been
> killed. This raises the question: how could the server have had enough room
> for these processes before but not now? Using a monitoring service (ganglia)
> shows that one server still has 1 MB of memory free and the other 7 MB when
> the tablet and logger servers are killed at the time of recovery. 
> Is the solution to this to allocate more heap space to java, to change the
> Accumulo memory configurations, or something else? Of the machines in our
> cluster, all run CentOS 6.2, some x86 and some x86_64. The two servers in
> question are x86_64 machines, so there are other x86_64 machines with the
> same configurations as these two but have shown no problems.
> Thanks for working to help me understand this,
> Patrick Lynch


View raw message