accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dylan Hutchison <dhutc...@cs.washington.edu>
Subject Checking what a BatchWriter is stuck on; failure during split
Date Mon, 18 Apr 2016 21:38:32 GMT
Hi devs,

I'd like to ask your help in figuring out what is happening to a
BatchWriter.  The following gives my reasoning so far.

In Accumulo 1.7.1, I have a BatchWriter that is stuck in WAITING status in
its addMutation method.  I saw that it is stuck by jstack'ing the Accumulo
client.  It's been stuck like this for 16 hours.

The BatchWriter is supposed to wait when a mutation is added if no failures
have recorded and either (a) the total memory used exceeds the maximum
allowed for the BatchWriter, or (b) the batchwriter is currently flushed.
So we conclude that one of (a) or (b) have occurred and no failures were
recorded, at the time when addMutation was called.  I think (a) is likely.

The BatchWriter is supposed to notify itself when either (1) a flush
finishes, (2) a constraint violation or authorization failure or server
error or unknown error occurs, (3) memory usage decreases, which happens
when entries successfully send to the tablet server.  Since the BatchWriter
is stuck on WAITING, none of these conditions are occurring.

The BatchWriter has 3 write threads (the default number).  All three have
status TIMED_WAITING (parked) in jstack.  Their stack traces don't give
useful information.

Here's what I can tell from the tserver logs.  A new table (and tablet) was
created successfully.  The BatchWriter started writing to this tablet
steadily.  The logs show that the tablet (vb<<) flushed every 5 seconds or
so and major compacted at a steady periodic rate.

Everything looks good, until vb<< grew large enough that it needed
splitting.  This occurred about 42 minutes after the BatchWriter started
writing entries.  The logs show a failure in an UpdateSession that popped
up in the middle of the split operation.  This failure continues to show
for the next 15 hours.

I copied the portion of the tserver logs that look relevant to the split
below.  I highlighted the line reporting the first failure.  It occurs in
between when the split starts and when it finishes.

Any idea what could have caused this?  I don't know if the failure is
related to the BatchWriter being stuck in WAITING.  It seems likely.  I
think it is weird that the 3 write threads are all idle; at least one of
them should be doing something if the thread calling addMutation() is
waiting.

Here is a pastebin of the jstack <http://pastebin.com/TLfdrDwX>, though I
think I wrote the useful parts from it.

2016-04-17 22:38:06,436 [tablet.Tablet] TABLET_HIST: vb<< closed
2016-04-17 22:38:06,439 [tablet.Tablet] DEBUG: Files for low split vb;2436<
 [hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00008lh.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00009iz.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a08.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a4t.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a7m.rf, hdfs:
//localhost:9000/accumulo/tables/vb/default_tablet/C0000a8f.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8n.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8p.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8q.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a93.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a9b.rf,
hdfs://localhost:90
00/accumulo/tables/vb/default_tablet/M0000a9g.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aa7.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aap.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abe.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abn.rf]
2016-04-17 22:38:06,439 [tablet.Tablet] DEBUG: Files for high split
vb<;2436
 [hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00008lh.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00009iz.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a08.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a4t.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a7m.rf, hdfs
://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8f.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8n.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8p.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8q.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a93.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a9b.rf,
hdfs://localhost:9
000/accumulo/tables/vb/default_tablet/M0000a9g.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aa7.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aap.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abe.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abn.rf]
2016-04-17 22:38:06,440 [tserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:42302 !0 0 entries in 0.00 secs, nbTimes = [0 0 0.00 1]
2016-04-17 22:38:06,472 [tablet.Tablet] TABLET_HIST: vb<< split vb;2436<
vb<;2436
2016-04-17 22:38:06,472 [tablet.Tablet] DEBUG: offline split time :   3.22
secs
*2016-04-17 22:38:06,472 [tserver.TabletServer] INFO : Starting split: vb<<*
2016-04-17 22:38:06,472 [tserver.TabletServer] DEBUG: UpSess 127.0.0.1:41090
0 in 3.059s, at=[0 0 0.00 1] ft=3.010s(pt=3.010s lt=0.000s ct=0.000s)
2016-04-17 22:38:06,472 [tserver.NativeMap] DEBUG: Allocated native map
0x0000000009a20e60

*2016-04-17 22:38:06,473 [tserver.TabletServer] DEBUG: Failures: 1, first
extent vb<< successful commits: 0*

2016-04-17 22:38:06,473 [tablet.Tablet] DEBUG: Reloading constraints for
extent: vb;2436<
2016-04-17 22:38:06,473 [constraints.ConstraintChecker] DEBUG: Loaded
constraint org.apache.accumulo.core.constraints.DefaultKeySizeConstraint
for vb
2016-04-17 22:38:06,474 [tablet.Tablet] TABLET_HIST: vb;2436< opened
2016-04-17 22:38:06,474 [tserver.NativeMap] DEBUG: Allocated native map
0x0000000003edf090
2016-04-17 22:38:06,474 [tablet.Tablet] DEBUG: Reloading constraints for
extent: vb<;2436
2016-04-17 22:38:06,474 [constraints.ConstraintChecker] DEBUG: Loaded
constraint org.apache.accumulo.core.constraints.DefaultKeySizeConstraint
for vb
2016-04-17 22:38:06,475 [tablet.Tablet] TABLET_HIST: vb<;2436 opened
*2016-04-17 22:38:06,475 [tserver.TabletServer] INFO : Tablet split: vb<<
size0 539900144 size1 539900148 time 8268ms*
2016-04-17 22:38:06,594 [tserver.TabletServer] DEBUG: MultiScanSess
127.0.0.1:51092 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1
ranges:1)
2016-04-17 22:38:06,597 [tserver.TabletServer] DEBUG: MultiScanSess
127.0.0.1:51092 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1
ranges:1)
2016-04-17 22:38:07,568 [tserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:42307 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
2016-04-17 22:38:07,571 [zookeeper.DistributedWorkQueue] DEBUG: Looking for
work in /accumulo/c7998e9d-129d-4c61-9732-45bb6bcab3f0/bulk_failed_copyq
2016-04-17 22:38:07,940 [tserver.TabletServer] DEBUG: UpSess 127.0.0.1:41090
0 in 0.067s, at=[1 1 1.00 1] ft=0.000s(pt=0.000s lt=0.000s ct=0.000s)
2016-04-17 22:38:07,941 [tserver.TabletServer] DEBUG: Failures: 1, first
extent vb<< successful commits: 0

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message