accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <>
Subject Re: Determining the cause of a tablet server failure
Date Wed, 27 Feb 2013 22:46:22 GMT
I agree with John Vines.

Christopher L Tubbs II

On Wed, Feb 27, 2013 at 12:32 PM, John Vines <> wrote:
> I don't like the idea of blending manual logging with log4j in a single
> file. It's in the .err file already, I don't think anything else is
> necessary.
> On Wed, Feb 27, 2013 at 3:27 PM, Adam Fuchs <> wrote:
>> So, question for the community: inside bin/accumulo we have:
>>   -XX:OnOutOfMemoryError="kill -9 %p"
>> Should this also append a log message? Something like:
>>   -XX:OnOutOfMemoryError="kill -9 %p; echo "ran out of memory >>
>> logfilename"
>> Is this necessary, or should the OutOfMemoryException still find its way
>> to the regular log?
>> Adam
>> On Wed, Feb 27, 2013 at 3:17 PM, Mike Hugo <> wrote:
>>> I'm chalking this up to a mis-configured server.  It looks like during
>>> the install on this server the file was copied from the
>>> examples, but rather than setting editing it to set the JAVA_HOME,
>>> HADOOP_HOME, and ZOOKEEPER_HOME, the entire file contents were replaced with
>>> those env variables.
>>> I'm assuming this caused us to pick up the default (?)  _OPTS settings
>>> rather than the correct ones we should have been getting based on our server
>>> memory capacity from the examples.  So we had a bunch of accumulo related
>>> java processes all running with memory settings that were way out of whack
>>> from what they should have been.
>>> To solve it I copied in the files from the conf/examples directory again
>>> and made sure everything was set up correctly and restarted everything.
>>> We never did see anything in out log files or .out / .err logs indicating
>>> the source of the problem, but the above is my best guess as to what was
>>> going on.
>>> Thanks again for all the tips and pointers!
>>> Mike
>>> On Wed, Feb 27, 2013 at 11:24 AM, Adam Fuchs <> wrote:
>>>> There are a few primary reasons why your tablet server would die:
>>>> 1. Lost lock in Zookeeper. If the tablet server and zookeeper can't
>>>> communicate with each other then the lock will timeout and the tablet server
>>>> will kill itself. This should show up as several messages in the tserver
>>>> log. If this happens when a tablet server is really busy (lots of threads
>>>> doing stuff) then the log message about the lost lock can be pretty far back
>>>> in the queue. Java garbage collection can cause long pauses that inhibit
>>>> tserver/zookeeper messages. Zookeeper can also get overwhelmed and behave
>>>> poorly if the server it's running on swaps it out.
>>>> 2. Problems talking with the master. If a tablet server is too slow in
>>>> communicating with the master then the master will try to kill it. This
>>>> should show up in the master log, and also will be noted in the tserver log.
>>>> 3. Out of memory. If the tserver JVM runs out of memory it will
>>>> terminate. As John mentioned, this will be in the .err or .out files in the
>>>> log directory.
>>>> Adam
>>>> On Wed, Feb 27, 2013 at 12:10 PM, Mike Hugo <> wrote:
>>>>> After running an ingest process via map reduce for about an hour or so,
>>>>> one of our tserver fails.  It happens pretty consistently, we're able
>>>>> replicate it without too much difficulty.
>>>>> I'm looking in the $ACCUMULO_HOME/logs directory for clues as to why
>>>>> the tserver fails, but I'm not seeing much that points to a cause of
>>>>> tserver going offline.   One minute it's there, the next it's offline.
>>>>> There are some warnings about the swappiness as well as a large row that
>>>>> cannot be spit but other than that, not much else to go on.
>>>>> Is there anything that could help me figure out *why* the tserver died?
>>>>> I'm guessing it's something in our client code or a config that's not
>>>>> correct on the server, but it'd be really nice to have a hint before
>>>>> start randomly changing things to see what will fix it.
>>>>> Thanks,
>>>>> Mike

View raw message