accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Fwd: why compaction failure on one table brings other tables offline, how to recover
Date Mon, 11 Apr 2016 23:48:34 GMT
Sorry, I meant that to Jayesh, not to you, Christopher :)

Christopher wrote:
> I just meant that if there is a problem loading one tablet, other
> tablets may stay indefinitely in an offline state due to ACCUMULO-4160,
> however it got to that point.
>
> On Mon, Apr 11, 2016 at 6:35 PM Josh Elser <josh.elser@gmail.com
> <mailto:josh.elser@gmail.com>> wrote:
>
>     Do you mean that after an OOME, the tserver process didn't die and got
>     into this bad state with an permanently offline tablet?
>
>     Christopher wrote:
>      > You might be seeing
>     https://issues.apache.org/jira/browse/ACCUMULO-4160
>      >
>      > On Mon, Apr 11, 2016 at 5:52 PM Jayesh Patel <jpatel@keywcorp.com
>     <mailto:jpatel@keywcorp.com>
>      > <mailto:jpatel@keywcorp.com <mailto:jpatel@keywcorp.com>>> wrote:
>      >
>      >     There really aren't a lot of log messages that can explain why
>      >     tablets for other tables went offline except the following:
>      >
>      >     2016-04-11 13:32:18,258
>      >     [tserver.TabletServerResourceManager$AssignmentWatcher] WARN :
>      >     tserver:instance-accumulo-3 Assignment for 2<< has been
>     running for
>      >     at least 973455566ms
>      >     java.lang.Exception: Assignment of 2<<
>      >          at sun.misc.Unsafe.park(Native Method)
>      >          at java.util.concurrent.locks.LockSupport.park(Unknown
>     Source)
>      >          at
>      >
>       java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown
>      >     Source)
>      >          at
>      >
>       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(Unknown
>      >     Source)
>      >          at
>      >
>       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown Source)
>      >          at
>      >
>       java.util.concurrent.locks.ReentrantLock$FairSync.lock(Unknown Source)
>      >          at java.util.concurrent.locks.ReentrantLock.lock(Unknown
>     Source)
>      >          at
>      >
>       org.apache.accumulo.tserver.TabletServer.acquireRecoveryMemory(TabletServer.java:2230)
>      >          at
>      >
>       org.apache.accumulo.tserver.TabletServer.access$2600(TabletServer.java:252)
>      >          at
>      >
>       org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2150)
>      >          at
>      >
>       org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>      >          at
>      >
>       org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
>      >          at
>      >
>       org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>      >          at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
>      >     Source)
>      >          at
>     java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
>      >     Source)
>      >          at
>      >
>       org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>      >          at java.lang.Thread.run(Unknown Source)
>      >
>      >     Table 2<< here doesn't have the issue with minc failing and so
>      >     shouldn’t be offline.  These messages happened on a restart of a
>      >     tserver if that offers any clues.  All the nodes were rebooted at
>      >     that time due to a power failure.  I'm assuming that it's tablet
>      >     went offline soon after this message first appeared in the logs.
>      >
>      >     Other tidbit of note is that the Accumulo operates for hours/days
>      >     without taking the tablets offline even though minc is
>     failing and
>      >     it's the crash of a tserver due to OutOfMemory situation in
>     one case
>      >     that seems to have taken the tablet offline.  Is it safe to
>     assume
>      >     that other tservers are not able to pick up the tablets that are
>      >     failing minc from a crashed tserver?
>      >
>      >     -----Original Message-----
>      >     From: Josh Elser [mailto:josh.elser@gmail.com
>     <mailto:josh.elser@gmail.com>
>      > <mailto:josh.elser@gmail.com <mailto:josh.elser@gmail.com>>]
>      >     Sent: Friday, April 08, 2016 10:52 AM
>      >     To: user@accumulo.apache.org
>     <mailto:user@accumulo.apache.org> <mailto:user@accumulo.apache.org
>     <mailto:user@accumulo.apache.org>>
>      >     Subject: Re: Fwd: why compaction failure on one table brings
>     other
>      >     tables offline, how to recover
>      >
>      >
>      >
>      >     Billie Rinaldi wrote:
>      > > *From:* Jayesh Patel
>      > > *Sent:* Thursday, April 07, 2016 4:36 PM
>      > > *To:* 'user@accumulo.apache.org
>     <mailto:user@accumulo.apache.org> <mailto:user@accumulo.apache.org
>     <mailto:user@accumulo.apache.org>>
>      > <mailto:user@accumulo.apache.org
>     <mailto:user@accumulo.apache.org> <mailto:user@accumulo.apache.org
>     <mailto:user@accumulo.apache.org>>>'
>      > > <user@accumulo.apache.org <mailto:user@accumulo.apache.org>
>     <mailto:user@accumulo.apache.org <mailto:user@accumulo.apache.org>>
>      > <mailto:user@accumulo.apache.org
>     <mailto:user@accumulo.apache.org> <mailto:user@accumulo.apache.org
>     <mailto:user@accumulo.apache.org>>>>
>      > > *Subject:* RE: why compaction failure on one table brings other
>      >     tables
>      > > offline, how to recover____
>      > >
>      > > __ __
>      > >
>      > > I have a 3 node Accumulo 1.7 cluster with a few small tables
>     (few MB
>      > > in size at most).____
>      > >
>      > > __ __
>      > >
>      > > I had one of those table fail minc because I had configured a
>      > > SummingCombiner with FIXEDLEN but had smaller values:____
>      > >
>      > > MinC failed (trying to convert to long, but byte array isn't long
>      > > enough, wanted 8 found 1) to create
>      > >
>      >
>       hdfs://instance-accumulo:8020/accumulo/tables/1/default_tablet/F0002bc
>      > > s.rf_tmp
>      > > retrying ...____
>      > >
>      > > __ __
>      > >
>      > > I have learned since to set the ‘lossy’ parameter to true to
>      >     avoid this.
>      > > *Why is the default value for it false* if it can cause
>     catastrophic
>      > > failure that you’ll read about ahead.____
>      >
>      >     I'm pretty sure I told you this on StackOverflow, but if
>     you're not
>      >     writing 8-byte long values, don't used FIXEDLEN. Use VARLEN
>     instead.
>      >
>      > > However, this brought other the tablets for other tables offline
>      > > without any apparent errors or warnings. *Can someone please
>     explain
>      > > why?*____
>      >
>      >     Can you provide logs? We are not wizards :)
>      >
>      > > In order to recover from this, I did a ‘droptable’ from the
>     shell on
>      > > the affected tables, but they all got stuck in the ‘DELETING’
>     state.
>      > > I was able to finally delete them using zkcli ‘rmr’ command. *Is
>      >     there
>      > > a better way?____*
>      >
>      >     Again, not sure why they would have gotten stuck in the deleting
>      >     phase without more logs/context (nor how far along in the
>     deletion
>      >     process they got). It's possible that there were still entries in
>      >     the accumulo.metadata table.
>      >
>      > > I’m assuming there is a more proper way because when I created the
>      > > tables again (with the same name), they went back to having a
>     single
>      > > offline tablet right away. *Is this because there are “traces”
>     of the
>      > > old table left behind that affect the new table even though the new
>      > > table has a different table id?*  I ended up wiping out hdfs and
>      > > recreating the accumulo instance. ____
>      >
>      >     Accumulo uses monotonically increasing IDs to identify
>     tables. The
>      >     human-readable names are only there for your benefit. Creating a
>      >     table with the same name would not cause a problem. It sounds
>     like
>      >     you got the metadata table in a bad state or have
>     tabletservers in a
>      >     bad state (if you haven't restarted them).
>      >
>      > > It seems that a small bug, writing 1 byte value instead of 8 bytes,
>      > > caused us to dump the whole accumulo instance.  Luckily the data
>      > > wasn’t that important, but this whole episode makes us wonder why
>      > > doing things the right way (assuming there is a right way) wasn’t
>      > > obvious or if Accumulo is just very fragile.____
>      > >
>      >
>      >     Causing Accumulo to be unable to flush data from memory to
>     disk in a
>      >     minor compaction is a very bad idea. One that we cannot
>      >     automatically recover from because of the combiner
>     configuration you
>      >     set.
>      >
>      >     If you can provide logs and stack traces from the Accumulo
>     services,
>      >     we can try to help you further. This is not normal. If you don't
>      >     believe me, take a look at the distributed tests we run each
>     release
>      >     where we write hundreds of gigabytes of data across many servers
>      >     while randomly killing Accumulo processes.
>      >
>      > >
>      > > Please ask away any questions/clarification you might have. We’ll
>      > > appreciate any input you might have so we make educated decisions
>      > > about using Accumulo going forward.____
>      > >
>      > > __ __
>      > >
>      > > Thank you,____
>      > >
>      > > Jayesh____
>      > >
>      > >
>      >
>

Mime
View raw message