accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dylan Hutchison <dhutc...@cs.washington.edu>
Subject Re: Checking what a BatchWriter is stuck on; failure during split
Date Tue, 19 Apr 2016 22:22:20 GMT
I investigated a bit more and I am pretty sure the problem is that the
BatchWriter is not recognizing that the tablet vb<< split into vb;2436< and
vb<;2436.  It keeps trying to update the closed tablet vb<<.  Each update
writes 0 mutations and records a failure at the tablet server UpdateSession
because vb<< is closed.

I'm not sure why this is happening because the BatchWriter should have
invalidated its tablet locator cache upon recognizing a failure.  Then it
would recognize that the entries it wants to write fall into the new
tablets vb;2436< and vb<;2436.  I think there is a timing bug for this edge
case, when a table split occurs during heavy writes.

I will write this up if I can reproduce it.  Maybe it is too rare to matter.

Cheers, Dylan

On Mon, Apr 18, 2016 at 2:38 PM, Dylan Hutchison <dhutchis@cs.washington.edu
> wrote:

> Hi devs,
>
> I'd like to ask your help in figuring out what is happening to a
> BatchWriter.  The following gives my reasoning so far.
>
> In Accumulo 1.7.1, I have a BatchWriter that is stuck in WAITING status in
> its addMutation method.  I saw that it is stuck by jstack'ing the Accumulo
> client.  It's been stuck like this for 16 hours.
>
> The BatchWriter is supposed to wait when a mutation is added if no
> failures have recorded and either (a) the total memory used exceeds the
> maximum allowed for the BatchWriter, or (b) the batchwriter is currently
> flushed.  So we conclude that one of (a) or (b) have occurred and no
> failures were recorded, at the time when addMutation was called.  I think
> (a) is likely.
>
> The BatchWriter is supposed to notify itself when either (1) a flush
> finishes, (2) a constraint violation or authorization failure or server
> error or unknown error occurs, (3) memory usage decreases, which happens
> when entries successfully send to the tablet server.  Since the BatchWriter
> is stuck on WAITING, none of these conditions are occurring.
>
> The BatchWriter has 3 write threads (the default number).  All three have
> status TIMED_WAITING (parked) in jstack.  Their stack traces don't give
> useful information.
>
> Here's what I can tell from the tserver logs.  A new table (and tablet)
> was created successfully.  The BatchWriter started writing to this tablet
> steadily.  The logs show that the tablet (vb<<) flushed every 5 seconds or
> so and major compacted at a steady periodic rate.
>
> Everything looks good, until vb<< grew large enough that it needed
> splitting.  This occurred about 42 minutes after the BatchWriter started
> writing entries.  The logs show a failure in an UpdateSession that popped
> up in the middle of the split operation.  This failure continues to show
> for the next 15 hours.
>
> I copied the portion of the tserver logs that look relevant to the split
> below.  I highlighted the line reporting the first failure.  It occurs in
> between when the split starts and when it finishes.
>
> Any idea what could have caused this?  I don't know if the failure is
> related to the BatchWriter being stuck in WAITING.  It seems likely.  I
> think it is weird that the 3 write threads are all idle; at least one of
> them should be doing something if the thread calling addMutation() is
> waiting.
>
> Here is a pastebin of the jstack <http://pastebin.com/TLfdrDwX>, though I
> think I wrote the useful parts from it.
>
> 2016-04-17 22:38:06,436 [tablet.Tablet] TABLET_HIST: vb<< closed
> 2016-04-17 22:38:06,439 [tablet.Tablet] DEBUG: Files for low split
> vb;2436<
>  [hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00008lh.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00009iz.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a08.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a4t.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a7m.rf, hdfs:
> //localhost:9000/accumulo/tables/vb/default_tablet/C0000a8f.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8n.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8p.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8q.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a93.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a9b.rf,
> hdfs://localhost:90
> 00/accumulo/tables/vb/default_tablet/M0000a9g.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aa7.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aap.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abe.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abn.rf]
> 2016-04-17 22:38:06,439 [tablet.Tablet] DEBUG: Files for high split
> vb<;2436
>  [hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00008lh.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00009iz.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a08.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a4t.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a7m.rf, hdfs
> ://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8f.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8n.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8p.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8q.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a93.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a9b.rf,
> hdfs://localhost:9
> 000/accumulo/tables/vb/default_tablet/M0000a9g.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aa7.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aap.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abe.rf,
> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abn.rf]
> 2016-04-17 22:38:06,440 [tserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:42302 !0 0 entries in 0.00 secs, nbTimes = [0 0 0.00 1]
> 2016-04-17 22:38:06,472 [tablet.Tablet] TABLET_HIST: vb<< split vb;2436<
> vb<;2436
> 2016-04-17 22:38:06,472 [tablet.Tablet] DEBUG: offline split time :   3.22
> secs
> *2016-04-17 22:38:06,472 [tserver.TabletServer] INFO : Starting split:
> vb<<*
> 2016-04-17 22:38:06,472 [tserver.TabletServer] DEBUG: UpSess
> 127.0.0.1:41090 0 in 3.059s, at=[0 0 0.00 1] ft=3.010s(pt=3.010s
> lt=0.000s ct=0.000s)
> 2016-04-17 22:38:06,472 [tserver.NativeMap] DEBUG: Allocated native map
> 0x0000000009a20e60
>
> *2016-04-17 22:38:06,473 [tserver.TabletServer] DEBUG: Failures: 1, first
> extent vb<< successful commits: 0*
>
> 2016-04-17 22:38:06,473 [tablet.Tablet] DEBUG: Reloading constraints for
> extent: vb;2436<
> 2016-04-17 22:38:06,473 [constraints.ConstraintChecker] DEBUG: Loaded
> constraint org.apache.accumulo.core.constraints.DefaultKeySizeConstraint
> for vb
> 2016-04-17 22:38:06,474 [tablet.Tablet] TABLET_HIST: vb;2436< opened
> 2016-04-17 22:38:06,474 [tserver.NativeMap] DEBUG: Allocated native map
> 0x0000000003edf090
> 2016-04-17 22:38:06,474 [tablet.Tablet] DEBUG: Reloading constraints for
> extent: vb<;2436
> 2016-04-17 22:38:06,474 [constraints.ConstraintChecker] DEBUG: Loaded
> constraint org.apache.accumulo.core.constraints.DefaultKeySizeConstraint
> for vb
> 2016-04-17 22:38:06,475 [tablet.Tablet] TABLET_HIST: vb<;2436 opened
> *2016-04-17 22:38:06,475 [tserver.TabletServer] INFO : Tablet split: vb<<
> size0 539900144 size1 539900148 time 8268ms*
> 2016-04-17 22:38:06,594 [tserver.TabletServer] DEBUG: MultiScanSess
> 127.0.0.1:51092 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1
> ranges:1)
> 2016-04-17 22:38:06,597 [tserver.TabletServer] DEBUG: MultiScanSess
> 127.0.0.1:51092 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1
> ranges:1)
> 2016-04-17 22:38:07,568 [tserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:42307 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 2016-04-17 22:38:07,571 [zookeeper.DistributedWorkQueue] DEBUG: Looking
> for work in /accumulo/c7998e9d-129d-4c61-9732-45bb6bcab3f0/bulk_failed_copyq
> 2016-04-17 22:38:07,940 [tserver.TabletServer] DEBUG: UpSess
> 127.0.0.1:41090 0 in 0.067s, at=[1 1 1.00 1] ft=0.000s(pt=0.000s
> lt=0.000s ct=0.000s)
> 2016-04-17 22:38:07,941 [tserver.TabletServer] DEBUG: Failures: 1, first
> extent vb<< successful commits: 0
>
>
>
>
>
>
>
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message