accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ivan Bella (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries
Date Thu, 11 Jan 2018 18:53:00 GMT


Ivan Bella commented on ACCUMULO-4777:

After several days of getting my head around this code, I think I figured it out.  There is
a AtomicInteger used as a sequence counter in the TabletServerLogger.  When this sequence
counter wraps (goes negative), an exception is thrown.  However in the write method where
it is thrown, it will subsequently close the current WAL, open a new one, and recursively
call itself via the defineTablet method.  This underlying call will fail for the same reason,
and then close the WAL, and recursively call it self again...etc, etc, etc.

So basically we have tablet servers that have been up long enough to actually incur over 2^31
writes into the WALs.  Once this happens, the server will go into this loop.  I am guessing
that not many systems leave the tablet servers up long enough for this to happen.  Also, this
is happening for us on tservers for which only the accumulo.metadata is pinned (via the HostRegexBalancer).
 Hence it is actually more likely to happen first on these tservers.

As far as I can tell, every path to this write method basically ignores the sequence number
returned.  So what is the real purpose of this sequence generator?  I think I need to original
authors of this code to tell me.  My inclination is to basically reset the sequence generator
back to 0 and just continue.  Any thoughts out there on this?

> Root tablet got spammed with 1.8 million log entries
> ----------------------------------------------------
>                 Key: ACCUMULO-4777
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>    Affects Versions: 1.8.1
>            Reporter: Ivan Bella
>            Priority: Critical
>             Fix For: 1.8.2, 2.0.0
> We had a tserver that was handling accumulo.metadata tablets that somehow got into a
loop where it created over 22K empty wal logs.  There were around 70 metadata tablets and
this resulted in around 1.8 million log entries in added to the accumulo.root table.  The
only reason it stopped creating wal logs is because it ran out of open file handles.  This
took us many hours and cups of coffee to clean up.
> The log contained the following messages in a tight loop:
> log.TabletServerLogger INFO : Using next log hdfs://...
> tserver.TabletServfer INFO : Writing log marker for hdfs://...
> tserver.TabletServer INFO : Marking hdfs://... closed
> log.DfsLogger INFO : Slow sync cost ...
> ...
> Unfortunately we did not have DEBUG turned on so we have no debug messages.
> Tracking through the code there are three places where the TabletServerLogger.close method
is called:
> 1) via resetLoggers in the TabletServerLogger, but nothing calls this method so this
is ruled out
> 2) when the log gets too large or too old, but neither of those checks should have been
hitting here.
> 3) In a loop that is executed (while (!success)) in the TabletServerLogger.write method.
 In this case when we unsuccessfullty write something to the wal, then that one is closed
and a new one is created.  This loop will go forever until we successfully write out the entry.
 A DfsLogger.LogClosedException seems the most logical reason.  This is most likely because
a ClosedChannelException was thrown from the DfsLogger.write methods (around line 609 in DfsLogger).
> So the root cause was most likely hadoop related.  However in accumulo we probably should
not be doing a tight retry loop around a hadoop failure.  I recommend at a minimum doing some
sort of exponential back off and perhaps setting a limit on the number of retries resulting
in a critical tserver failure.

This message was sent by Atlassian JIRA

View raw message