accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <eric.new...@gmail.com>
Subject Re: Walog Recovery Killing Loggers
Date Mon, 30 Jul 2012 22:11:40 GMT
The log recovery process uses more real memory than just logging does.

You can set logger.sort.buffer.size to something smaller than the default
200M to try to get recovery to complete.

You may also want to use the following incantation to speed recovery and
avoid timeouts:

 $ pkill -f =master
 $ hadoop fs -rm /accumulo/recovery/*.failed
 $ ./bin/start-here.sh

This should be run on the master.  The first step kills the master so it
can't remember it was in the recovery process.  The second removes markers
that remember failures.  The third brings the master back to start the
recovery process.

-Eric

On Mon, Jul 30, 2012 at 6:00 PM, Patrick Lynch <patricklynch33@aim.com>
wrote:
> I'm having a problem recovering from an improper shutdown of a tablet
> server.
>
> Originally, the tablet server was giving me warnings about being low on
> memory, so I wanted to update it's memory settings and restart it. Before
I
> did anything, the server worked fine running the tablet server and logger
--
> it warned about low memory but still operated.
>
> After editing the configuration, I called stop-here.sh in the machine the
> server was running on, which stopped the tablet server and logger
processes.
> Calling start-here.sh, however, did not do anything, and calling
> start-all.sh on the master server started the tablet server and logger
> processes but the server still appeared offline on the monitoring webpage.
> Eventually, I undid the configuration changes I had made and I was able to
> start the process by manually killing the tablet and logger processes and
> calling start-all.sh again, but then a new problem arose.
>
> The tablet server was now online, but its walog still needed to be
> recovered. Then, when the recovery began, it started the copy/sort process
> on the walogs of not only the server that was offline but another server
> (which I have discovered has a walog with the same contents as the walog
of
> the previously offline tablet but a different name). As soon as the
recovery
> process starts, the loggers of the two servers go offline, and the
recovery
> process lingers without advancing progress until the master gives up once
> the maximum recovery time is reached. When the loggers are offline, I am
> able to bring them online again by calling start-all.sh on the master, but
> it does not affect the progress of any current recovery and they go
offline
> again once the next recovery is attempted.
>
> Log files seem to reveal the core error at hand: the
> logger_server-address.in-addr.arpa.out log says that an OutOfMemory java
> heap space error has occurred and that the pid of the logger has been
> killed. This raises the question: how could the server have had enough
room
> for these processes before but not now? Using a monitoring service
(ganglia)
> shows that one server still has 1 MB of memory free and the other 7 MB
when
> the tablet and logger servers are killed at the time of recovery.
>
> Is the solution to this to allocate more heap space to java, to change the
> Accumulo memory configurations, or something else? Of the machines in our
> cluster, all run CentOS 6.2, some x86 and some x86_64. The two servers in
> question are x86_64 machines, so there are other x86_64 machines with the
> same configurations as these two but have shown no problems.
>
> Thanks for working to help me understand this,
>
> Patrick Lynch
>

Mime
View raw message