hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
Date Tue, 01 Sep 2015 05:51:46 GMT

     [ https://issues.apache.org/jira/browse/HBASE-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

stack updated HBASE-14317:
--------------------------
    Attachment: repro.txt

Repro of the hang seen in the original attachment raw.php. We cannot replace log because we
are waiting on the zig zag latch. We cannot close the region because we are waiting on a flush.
Flushes cannot progress because they are waiting on their sequenceid. Test timeouts after
60 seconds of hang. Test is testLockedUpWALSystem. Test is ugly because have to standup a
region and log roller... full of boilerplate mostly. Also reverts HBASE-13971. It only confuses.
Adds a method to FSHLog so I can hold processing around zigzaglatch creation.

Hang happens if a sync comes off the ring buffer AFTER we've created a SafePointZigZagLatch
-- the very existence of this object means the ringbuffer consuming thread will fall into
the attain safe point code block (even if the interrupting sync just throws an exception)
-- but BEFORE we have published the replaceWriter zigzag sync on to the ringbuffer: i.e if
the sync comes in AFTER line #794 in the below but BEFORE #806.

{code}
 784   Path replaceWriter(final Path oldPath, final Path newPath, Writer nextWriter,
 785       final FSDataOutputStream nextHdfsOut)
 786   throws IOException {
 787     // Ask the ring buffer writer to pause at a safe point.  Once we do this, the writer
 788     // thread will eventually pause. An error hereafter needs to release the writer thread
 789     // regardless -- hence the finally block below.  Note, this method is called from
the FSHLog
 790     // constructor BEFORE the ring buffer is set running so it is null on first time
through
 791     // here; allow for that.
 792     SyncFuture syncFuture = null;
 793     SafePointZigZagLatch zigzagLatch = (this.ringBufferEventHandler == null)?
 794       null: this.ringBufferEventHandler.attainSafePoint();
 795     afterZigZagLatch();
 796     TraceScope scope = Trace.startSpan("FSHFile.replaceWriter");
 797     try {
 798       // Wait on the safe point to be achieved.  Send in a sync in case nothing has hit
the
 799       // ring buffer between the above notification of writer that we want it to go to
 800       // 'safe point' and then here where we are waiting on it to attain safe point.
 Use
 801       // 'sendSync' instead of 'sync' because we do not want this thread to block waiting
on it
 802       // to come back.  Cleanup this syncFuture down below after we are ready to run
again.
 803       try {
 804         if (zigzagLatch != null) {
 805           Trace.addTimelineAnnotation("awaiting safepoint");
 806           syncFuture = zigzagLatch.waitSafePoint(publishSyncOnRingBuffer());
...{code}

Fix is here abouts:

{code}
    private void attainSafePoint(final long currentSequence) {
      if (this.zigzagLatch == null || !this.zigzagLatch.isCocked()) return;
...
{code}

.... needs to be more than existence of zigzagLatch and that it is cocked...  Let me chat
w/ [~eclark]





> Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
> -----------------------------------------------------
>
>                 Key: HBASE-14317
>                 URL: https://issues.apache.org/jira/browse/HBASE-14317
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.1.1
>            Reporter: stack
>            Priority: Critical
>         Attachments: 14317.test.txt, HBASE-14317-v1.patch, HBASE-14317-v2.patch, HBASE-14317-v3.patch,
HBASE-14317-v4.patch, HBASE-14317.patch, [Java] RS stuck on WAL sync to a dead DN - Pastebin.com.html,
append-only-test.patch, raw.php, repro.txt, san_dump.txt, subset.of.rs.log
>
>
> hbase-1.1.1 and hadoop-2.7.1
> We try to roll logs because can't append (See HDFS-8960) but we get stuck. See attached
thread dump and associated log. What is interesting is that syncers are waiting to take syncs
to run and at same time we want to flush so we are waiting on a safe point but there seems
to be nothing in our ring buffer; did we go to roll log and not add safe point sync to clear
out ringbuffer?
> Needs a bit of study. Try to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message