accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Checking what a BatchWriter is stuck on; failure during split
Date Tue, 19 Apr 2016 22:26:45 GMT
Nice findings. Sorry I haven't had any cycles to dig into this myself.

I look forward to hearing what you find :)

Dylan Hutchison wrote:
> I investigated a bit more and I am pretty sure the problem is that the
> BatchWriter is not recognizing that the tablet vb<<  split into vb;2436<  and
> vb<;2436.  It keeps trying to update the closed tablet vb<<.  Each update
> writes 0 mutations and records a failure at the tablet server UpdateSession
> because vb<<  is closed.
>
> I'm not sure why this is happening because the BatchWriter should have
> invalidated its tablet locator cache upon recognizing a failure.  Then it
> would recognize that the entries it wants to write fall into the new
> tablets vb;2436<  and vb<;2436.  I think there is a timing bug for this edge
> case, when a table split occurs during heavy writes.
>
> I will write this up if I can reproduce it.  Maybe it is too rare to matter.
>
> Cheers, Dylan
>
> On Mon, Apr 18, 2016 at 2:38 PM, Dylan Hutchison<dhutchis@cs.washington.edu
>> wrote:
>
>> Hi devs,
>>
>> I'd like to ask your help in figuring out what is happening to a
>> BatchWriter.  The following gives my reasoning so far.
>>
>> In Accumulo 1.7.1, I have a BatchWriter that is stuck in WAITING status in
>> its addMutation method.  I saw that it is stuck by jstack'ing the Accumulo
>> client.  It's been stuck like this for 16 hours.
>>
>> The BatchWriter is supposed to wait when a mutation is added if no
>> failures have recorded and either (a) the total memory used exceeds the
>> maximum allowed for the BatchWriter, or (b) the batchwriter is currently
>> flushed.  So we conclude that one of (a) or (b) have occurred and no
>> failures were recorded, at the time when addMutation was called.  I think
>> (a) is likely.
>>
>> The BatchWriter is supposed to notify itself when either (1) a flush
>> finishes, (2) a constraint violation or authorization failure or server
>> error or unknown error occurs, (3) memory usage decreases, which happens
>> when entries successfully send to the tablet server.  Since the BatchWriter
>> is stuck on WAITING, none of these conditions are occurring.
>>
>> The BatchWriter has 3 write threads (the default number).  All three have
>> status TIMED_WAITING (parked) in jstack.  Their stack traces don't give
>> useful information.
>>
>> Here's what I can tell from the tserver logs.  A new table (and tablet)
>> was created successfully.  The BatchWriter started writing to this tablet
>> steadily.  The logs show that the tablet (vb<<) flushed every 5 seconds or
>> so and major compacted at a steady periodic rate.
>>
>> Everything looks good, until vb<<  grew large enough that it needed
>> splitting.  This occurred about 42 minutes after the BatchWriter started
>> writing entries.  The logs show a failure in an UpdateSession that popped
>> up in the middle of the split operation.  This failure continues to show
>> for the next 15 hours.
>>
>> I copied the portion of the tserver logs that look relevant to the split
>> below.  I highlighted the line reporting the first failure.  It occurs in
>> between when the split starts and when it finishes.
>>
>> Any idea what could have caused this?  I don't know if the failure is
>> related to the BatchWriter being stuck in WAITING.  It seems likely.  I
>> think it is weird that the 3 write threads are all idle; at least one of
>> them should be doing something if the thread calling addMutation() is
>> waiting.
>>
>> Here is a pastebin of the jstack<http://pastebin.com/TLfdrDwX>, though I
>> think I wrote the useful parts from it.
>>
>> 2016-04-17 22:38:06,436 [tablet.Tablet] TABLET_HIST: vb<<  closed
>> 2016-04-17 22:38:06,439 [tablet.Tablet] DEBUG: Files for low split
>> vb;2436<
>>   [hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00008lh.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00009iz.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a08.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a4t.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a7m.rf, hdfs:
>> //localhost:9000/accumulo/tables/vb/default_tablet/C0000a8f.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8n.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8p.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8q.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a93.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a9b.rf,
>> hdfs://localhost:90
>> 00/accumulo/tables/vb/default_tablet/M0000a9g.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aa7.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aap.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abe.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abn.rf]
>> 2016-04-17 22:38:06,439 [tablet.Tablet] DEBUG: Files for high split
>> vb<;2436
>>   [hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00008lh.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00009iz.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a08.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a4t.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a7m.rf, hdfs
>> ://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8f.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8n.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8p.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8q.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a93.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a9b.rf,
>> hdfs://localhost:9
>> 000/accumulo/tables/vb/default_tablet/M0000a9g.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aa7.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aap.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abe.rf,
>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abn.rf]
>> 2016-04-17 22:38:06,440 [tserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:42302 !0 0 entries in 0.00 secs, nbTimes = [0 0 0.00 1]
>> 2016-04-17 22:38:06,472 [tablet.Tablet] TABLET_HIST: vb<<  split vb;2436<
>> vb<;2436
>> 2016-04-17 22:38:06,472 [tablet.Tablet] DEBUG: offline split time :   3.22
>> secs
>> *2016-04-17 22:38:06,472 [tserver.TabletServer] INFO : Starting split:
>> vb<<*
>> 2016-04-17 22:38:06,472 [tserver.TabletServer] DEBUG: UpSess
>> 127.0.0.1:41090 0 in 3.059s, at=[0 0 0.00 1] ft=3.010s(pt=3.010s
>> lt=0.000s ct=0.000s)
>> 2016-04-17 22:38:06,472 [tserver.NativeMap] DEBUG: Allocated native map
>> 0x0000000009a20e60
>>
>> *2016-04-17 22:38:06,473 [tserver.TabletServer] DEBUG: Failures: 1, first
>> extent vb<<  successful commits: 0*
>>
>> 2016-04-17 22:38:06,473 [tablet.Tablet] DEBUG: Reloading constraints for
>> extent: vb;2436<
>> 2016-04-17 22:38:06,473 [constraints.ConstraintChecker] DEBUG: Loaded
>> constraint org.apache.accumulo.core.constraints.DefaultKeySizeConstraint
>> for vb
>> 2016-04-17 22:38:06,474 [tablet.Tablet] TABLET_HIST: vb;2436<  opened
>> 2016-04-17 22:38:06,474 [tserver.NativeMap] DEBUG: Allocated native map
>> 0x0000000003edf090
>> 2016-04-17 22:38:06,474 [tablet.Tablet] DEBUG: Reloading constraints for
>> extent: vb<;2436
>> 2016-04-17 22:38:06,474 [constraints.ConstraintChecker] DEBUG: Loaded
>> constraint org.apache.accumulo.core.constraints.DefaultKeySizeConstraint
>> for vb
>> 2016-04-17 22:38:06,475 [tablet.Tablet] TABLET_HIST: vb<;2436 opened
>> *2016-04-17 22:38:06,475 [tserver.TabletServer] INFO : Tablet split: vb<<
>> size0 539900144 size1 539900148 time 8268ms*
>> 2016-04-17 22:38:06,594 [tserver.TabletServer] DEBUG: MultiScanSess
>> 127.0.0.1:51092 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1
>> ranges:1)
>> 2016-04-17 22:38:06,597 [tserver.TabletServer] DEBUG: MultiScanSess
>> 127.0.0.1:51092 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1
>> ranges:1)
>> 2016-04-17 22:38:07,568 [tserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:42307 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>> 2016-04-17 22:38:07,571 [zookeeper.DistributedWorkQueue] DEBUG: Looking
>> for work in /accumulo/c7998e9d-129d-4c61-9732-45bb6bcab3f0/bulk_failed_copyq
>> 2016-04-17 22:38:07,940 [tserver.TabletServer] DEBUG: UpSess
>> 127.0.0.1:41090 0 in 0.067s, at=[1 1 1.00 1] ft=0.000s(pt=0.000s
>> lt=0.000s ct=0.000s)
>> 2016-04-17 22:38:07,941 [tserver.TabletServer] DEBUG: Failures: 1, first
>> extent vb<<  successful commits: 0
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Mime
View raw message