accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Fwd: why compaction failure on one table brings other tables offline, how to recover
Date Mon, 11 Apr 2016 22:35:19 GMT
Do you mean that after an OOME, the tserver process didn't die and got 
into this bad state with an permanently offline tablet?

Christopher wrote:
> You might be seeing https://issues.apache.org/jira/browse/ACCUMULO-4160
>
> On Mon, Apr 11, 2016 at 5:52 PM Jayesh Patel <jpatel@keywcorp.com
> <mailto:jpatel@keywcorp.com>> wrote:
>
>     There really aren't a lot of log messages that can explain why
>     tablets for other tables went offline except the following:
>
>     2016-04-11 13:32:18,258
>     [tserver.TabletServerResourceManager$AssignmentWatcher] WARN :
>     tserver:instance-accumulo-3 Assignment for 2<< has been running for
>     at least 973455566ms
>     java.lang.Exception: Assignment of 2<<
>          at sun.misc.Unsafe.park(Native Method)
>          at java.util.concurrent.locks.LockSupport.park(Unknown Source)
>          at
>     java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown
>     Source)
>          at
>     java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(Unknown
>     Source)
>          at
>     java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown Source)
>          at
>     java.util.concurrent.locks.ReentrantLock$FairSync.lock(Unknown Source)
>          at java.util.concurrent.locks.ReentrantLock.lock(Unknown Source)
>          at
>     org.apache.accumulo.tserver.TabletServer.acquireRecoveryMemory(TabletServer.java:2230)
>          at
>     org.apache.accumulo.tserver.TabletServer.access$2600(TabletServer.java:252)
>          at
>     org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2150)
>          at
>     org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>          at
>     org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
>          at
>     org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>          at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
>     Source)
>          at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
>     Source)
>          at
>     org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>          at java.lang.Thread.run(Unknown Source)
>
>     Table 2<< here doesn't have the issue with minc failing and so
>     shouldn’t be offline.  These messages happened on a restart of a
>     tserver if that offers any clues.  All the nodes were rebooted at
>     that time due to a power failure.  I'm assuming that it's tablet
>     went offline soon after this message first appeared in the logs.
>
>     Other tidbit of note is that the Accumulo operates for hours/days
>     without taking the tablets offline even though minc is failing and
>     it's the crash of a tserver due to OutOfMemory situation in one case
>     that seems to have taken the tablet offline.  Is it safe to assume
>     that other tservers are not able to pick up the tablets that are
>     failing minc from a crashed tserver?
>
>     -----Original Message-----
>     From: Josh Elser [mailto:josh.elser@gmail.com
>     <mailto:josh.elser@gmail.com>]
>     Sent: Friday, April 08, 2016 10:52 AM
>     To: user@accumulo.apache.org <mailto:user@accumulo.apache.org>
>     Subject: Re: Fwd: why compaction failure on one table brings other
>     tables offline, how to recover
>
>
>
>     Billie Rinaldi wrote:
>      > *From:* Jayesh Patel
>      > *Sent:* Thursday, April 07, 2016 4:36 PM
>      > *To:* 'user@accumulo.apache.org <mailto:user@accumulo.apache.org>
>     <mailto:user@accumulo.apache.org <mailto:user@accumulo.apache.org>>'
>      > <user@accumulo.apache.org <mailto:user@accumulo.apache.org>
>     <mailto:user@accumulo.apache.org <mailto:user@accumulo.apache.org>>>
>      > *Subject:* RE: why compaction failure on one table brings other
>     tables
>      > offline, how to recover____
>      >
>      > __ __
>      >
>      > I have a 3 node Accumulo 1.7 cluster with a few small tables (few MB
>      > in size at most).____
>      >
>      > __ __
>      >
>      > I had one of those table fail minc because I had configured a
>      > SummingCombiner with FIXEDLEN but had smaller values:____
>      >
>      > MinC failed (trying to convert to long, but byte array isn't long
>      > enough, wanted 8 found 1) to create
>      >
>     hdfs://instance-accumulo:8020/accumulo/tables/1/default_tablet/F0002bc
>      > s.rf_tmp
>      > retrying ...____
>      >
>      > __ __
>      >
>      > I have learned since to set the ‘lossy’ parameter to true to
>     avoid this.
>      > *Why is the default value for it false* if it can cause catastrophic
>      > failure that you’ll read about ahead.____
>
>     I'm pretty sure I told you this on StackOverflow, but if you're not
>     writing 8-byte long values, don't used FIXEDLEN. Use VARLEN instead.
>
>      > However, this brought other the tablets for other tables offline
>      > without any apparent errors or warnings. *Can someone please explain
>      > why?*____
>
>     Can you provide logs? We are not wizards :)
>
>      > In order to recover from this, I did a ‘droptable’ from the shell on
>      > the affected tables, but they all got stuck in the ‘DELETING’ state.
>      > I was able to finally delete them using zkcli ‘rmr’ command. *Is
>     there
>      > a better way?____*
>
>     Again, not sure why they would have gotten stuck in the deleting
>     phase without more logs/context (nor how far along in the deletion
>     process they got). It's possible that there were still entries in
>     the accumulo.metadata table.
>
>      > I’m assuming there is a more proper way because when I created the
>      > tables again (with the same name), they went back to having a single
>      > offline tablet right away. *Is this because there are “traces” of the
>      > old table left behind that affect the new table even though the new
>      > table has a different table id?*  I ended up wiping out hdfs and
>      > recreating the accumulo instance. ____
>
>     Accumulo uses monotonically increasing IDs to identify tables. The
>     human-readable names are only there for your benefit. Creating a
>     table with the same name would not cause a problem. It sounds like
>     you got the metadata table in a bad state or have tabletservers in a
>     bad state (if you haven't restarted them).
>
>      > It seems that a small bug, writing 1 byte value instead of 8 bytes,
>      > caused us to dump the whole accumulo instance.  Luckily the data
>      > wasn’t that important, but this whole episode makes us wonder why
>      > doing things the right way (assuming there is a right way) wasn’t
>      > obvious or if Accumulo is just very fragile.____
>      >
>
>     Causing Accumulo to be unable to flush data from memory to disk in a
>     minor compaction is a very bad idea. One that we cannot
>     automatically recover from because of the combiner configuration you
>     set.
>
>     If you can provide logs and stack traces from the Accumulo services,
>     we can try to help you further. This is not normal. If you don't
>     believe me, take a look at the distributed tests we run each release
>     where we write hundreds of gigabytes of data across many servers
>     while randomly killing Accumulo processes.
>
>      >
>      > Please ask away any questions/clarification you might have. We’ll
>      > appreciate any input you might have so we make educated decisions
>      > about using Accumulo going forward.____
>      >
>      > __ __
>      >
>      > Thank you,____
>      >
>      > Jayesh____
>      >
>      >
>

Mime
View raw message