accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: Fwd: why compaction failure on one table brings other tables offline, how to recover
Date Mon, 11 Apr 2016 23:00:30 GMT
I just meant that if there is a problem loading one tablet, other tablets
may stay indefinitely in an offline state due to ACCUMULO-4160, however it
got to that point.

On Mon, Apr 11, 2016 at 6:35 PM Josh Elser <josh.elser@gmail.com> wrote:

> Do you mean that after an OOME, the tserver process didn't die and got
> into this bad state with an permanently offline tablet?
>
> Christopher wrote:
> > You might be seeing https://issues.apache.org/jira/browse/ACCUMULO-4160
> >
> > On Mon, Apr 11, 2016 at 5:52 PM Jayesh Patel <jpatel@keywcorp.com
> > <mailto:jpatel@keywcorp.com>> wrote:
> >
> >     There really aren't a lot of log messages that can explain why
> >     tablets for other tables went offline except the following:
> >
> >     2016-04-11 13:32:18,258
> >     [tserver.TabletServerResourceManager$AssignmentWatcher] WARN :
> >     tserver:instance-accumulo-3 Assignment for 2<< has been running for
> >     at least 973455566ms
> >     java.lang.Exception: Assignment of 2<<
> >          at sun.misc.Unsafe.park(Native Method)
> >          at java.util.concurrent.locks.LockSupport.park(Unknown Source)
> >          at
> >
>  java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown
> >     Source)
> >          at
> >
>  java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(Unknown
> >     Source)
> >          at
> >
>  java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown
> Source)
> >          at
> >     java.util.concurrent.locks.ReentrantLock$FairSync.lock(Unknown
> Source)
> >          at java.util.concurrent.locks.ReentrantLock.lock(Unknown Source)
> >          at
> >
>  org.apache.accumulo.tserver.TabletServer.acquireRecoveryMemory(TabletServer.java:2230)
> >          at
> >
>  org.apache.accumulo.tserver.TabletServer.access$2600(TabletServer.java:252)
> >          at
> >
>  org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2150)
> >          at
> >
>  org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> >          at
> >
>  org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
> >          at
> >     org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
> >          at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> >     Source)
> >          at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> >     Source)
> >          at
> >
>  org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> >          at java.lang.Thread.run(Unknown Source)
> >
> >     Table 2<< here doesn't have the issue with minc failing and so
> >     shouldn’t be offline.  These messages happened on a restart of a
> >     tserver if that offers any clues.  All the nodes were rebooted at
> >     that time due to a power failure.  I'm assuming that it's tablet
> >     went offline soon after this message first appeared in the logs.
> >
> >     Other tidbit of note is that the Accumulo operates for hours/days
> >     without taking the tablets offline even though minc is failing and
> >     it's the crash of a tserver due to OutOfMemory situation in one case
> >     that seems to have taken the tablet offline.  Is it safe to assume
> >     that other tservers are not able to pick up the tablets that are
> >     failing minc from a crashed tserver?
> >
> >     -----Original Message-----
> >     From: Josh Elser [mailto:josh.elser@gmail.com
> >     <mailto:josh.elser@gmail.com>]
> >     Sent: Friday, April 08, 2016 10:52 AM
> >     To: user@accumulo.apache.org <mailto:user@accumulo.apache.org>
> >     Subject: Re: Fwd: why compaction failure on one table brings other
> >     tables offline, how to recover
> >
> >
> >
> >     Billie Rinaldi wrote:
> >      > *From:* Jayesh Patel
> >      > *Sent:* Thursday, April 07, 2016 4:36 PM
> >      > *To:* 'user@accumulo.apache.org <mailto:user@accumulo.apache.org>
> >     <mailto:user@accumulo.apache.org <mailto:user@accumulo.apache.org>>'
> >      > <user@accumulo.apache.org <mailto:user@accumulo.apache.org>
> >     <mailto:user@accumulo.apache.org <mailto:user@accumulo.apache.org>>>
> >      > *Subject:* RE: why compaction failure on one table brings other
> >     tables
> >      > offline, how to recover____
> >      >
> >      > __ __
> >      >
> >      > I have a 3 node Accumulo 1.7 cluster with a few small tables (few
> MB
> >      > in size at most).____
> >      >
> >      > __ __
> >      >
> >      > I had one of those table fail minc because I had configured a
> >      > SummingCombiner with FIXEDLEN but had smaller values:____
> >      >
> >      > MinC failed (trying to convert to long, but byte array isn't long
> >      > enough, wanted 8 found 1) to create
> >      >
> >
>  hdfs://instance-accumulo:8020/accumulo/tables/1/default_tablet/F0002bc
> >      > s.rf_tmp
> >      > retrying ...____
> >      >
> >      > __ __
> >      >
> >      > I have learned since to set the ‘lossy’ parameter to true to
> >     avoid this.
> >      > *Why is the default value for it false* if it can cause
> catastrophic
> >      > failure that you’ll read about ahead.____
> >
> >     I'm pretty sure I told you this on StackOverflow, but if you're not
> >     writing 8-byte long values, don't used FIXEDLEN. Use VARLEN instead.
> >
> >      > However, this brought other the tablets for other tables offline
> >      > without any apparent errors or warnings. *Can someone please
> explain
> >      > why?*____
> >
> >     Can you provide logs? We are not wizards :)
> >
> >      > In order to recover from this, I did a ‘droptable’ from the shell
> on
> >      > the affected tables, but they all got stuck in the ‘DELETING’
> state.
> >      > I was able to finally delete them using zkcli ‘rmr’ command. *Is
> >     there
> >      > a better way?____*
> >
> >     Again, not sure why they would have gotten stuck in the deleting
> >     phase without more logs/context (nor how far along in the deletion
> >     process they got). It's possible that there were still entries in
> >     the accumulo.metadata table.
> >
> >      > I’m assuming there is a more proper way because when I created the
> >      > tables again (with the same name), they went back to having a
> single
> >      > offline tablet right away. *Is this because there are “traces” of
> the
> >      > old table left behind that affect the new table even though the
> new
> >      > table has a different table id?*  I ended up wiping out hdfs and
> >      > recreating the accumulo instance. ____
> >
> >     Accumulo uses monotonically increasing IDs to identify tables. The
> >     human-readable names are only there for your benefit. Creating a
> >     table with the same name would not cause a problem. It sounds like
> >     you got the metadata table in a bad state or have tabletservers in a
> >     bad state (if you haven't restarted them).
> >
> >      > It seems that a small bug, writing 1 byte value instead of 8
> bytes,
> >      > caused us to dump the whole accumulo instance.  Luckily the data
> >      > wasn’t that important, but this whole episode makes us wonder why
> >      > doing things the right way (assuming there is a right way) wasn’t
> >      > obvious or if Accumulo is just very fragile.____
> >      >
> >
> >     Causing Accumulo to be unable to flush data from memory to disk in a
> >     minor compaction is a very bad idea. One that we cannot
> >     automatically recover from because of the combiner configuration you
> >     set.
> >
> >     If you can provide logs and stack traces from the Accumulo services,
> >     we can try to help you further. This is not normal. If you don't
> >     believe me, take a look at the distributed tests we run each release
> >     where we write hundreds of gigabytes of data across many servers
> >     while randomly killing Accumulo processes.
> >
> >      >
> >      > Please ask away any questions/clarification you might have. We’ll
> >      > appreciate any input you might have so we make educated decisions
> >      > about using Accumulo going forward.____
> >      >
> >      > __ __
> >      >
> >      > Thank you,____
> >      >
> >      > Jayesh____
> >      >
> >      >
> >
>

Mime
View raw message