hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Duo Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-15537) Make multi WAL work with WALs other than FSHLog
Date Wed, 06 Apr 2016 12:44:25 GMT

    [ https://issues.apache.org/jira/browse/HBASE-15537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228192#comment-15228192
] 

Duo Zhang commented on HBASE-15537:
-----------------------------------

For master, I tried {{TestNamespaceCommands}} locally, it ran really slow, haven't found the
reason yet.

And for branch-1, TestFailedAppendAndSync is failed because of timeout. I see the log, there
must be some corner cases that have not been handled.

This is the failed test output
https://builds.apache.org/job/PreCommit-HBASE-Build/1305/testReport/org.apache.hadoop.hbase.regionserver/TestFailedAppendAndSync/testLockupAroundBadAssignSync/
{noformat}
2016-04-06 11:04:38,070 ERROR [sync.2] wal.FSHLog$SyncRunner(1239): Error syncing, request
close of WAL
java.io.IOException: FAKE! Failed to replace a bad datanode...
	at org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog$1.sync(TestFailedAppendAndSync.java:139)
	at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1235)
	at java.lang.Thread.run(Thread.java:745)
2016-04-06 11:04:38,071 DEBUG [Thread-4] regionserver.LogRoller(139): WAL roll requested
2016-04-06 11:04:38,071 DEBUG [Time-limited test] regionserver.HRegion(3842): rollbackMemstore
rolled back 1
2016-04-06 11:04:38,148 ERROR [sync.3] wal.FSHLog$SyncRunner(1239): Error syncing, request
close of WAL
java.io.IOException: FAKE! Failed to replace a bad datanode...
	at org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog$1.sync(TestFailedAppendAndSync.java:139)
	at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1235)
	at java.lang.Thread.run(Thread.java:745)
2016-04-06 11:04:38,151 INFO  [Thread-4] wal.FSHLog(870): Rolled WAL /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/hbase-server/target/test-data/3b0ad6d4-bf70-4159-8463-9c5accf75071/TestHRegiontestLockupAroundBadAssignSync/testLockupAroundBadAssignSync/wal.1459940677946
with entries=1, filesize=255 B; new WAL /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/hbase-server/target/test-data/3b0ad6d4-bf70-4159-8463-9c5accf75071/TestHRegiontestLockupAroundBadAssignSync/testLockupAroundBadAssignSync/wal.1459940678071
2016-04-06 11:09:35,215 INFO  [main] regionserver.TestFailedAppendAndSync(93): Cleaning test
directory: /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/hbase-server/target/test-data/3b0ad6d4-bf70-4159-8463-9c5accf75071
{noformat}

You can see that the wal roll is succeeded(we expected an abort here caused by wal roll fail).
This is the typical log
{noformat}
2016-04-06 20:20:21,352 ERROR [sync.2] wal.FSHLog$SyncRunner(1239): Error syncing, request
close of WAL
java.io.IOException: FAKE! Failed to replace a bad datanode...
	at org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog$1.sync(TestFailedAppendAndSync.java:139)
	at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1235)
	at java.lang.Thread.run(Thread.java:745)
2016-04-06 20:20:21,353 DEBUG [Time-limited test] regionserver.HRegion(3842): rollbackMemstore
rolled back 1
2016-04-06 20:20:21,354 DEBUG [Thread-4] regionserver.LogRoller(139): WAL roll requested
2016-04-06 20:20:21,378 ERROR [sync.3] wal.FSHLog$SyncRunner(1239): Error syncing, request
close of WAL
java.io.IOException: FAKE! Failed to replace a bad datanode...
	at org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog$1.sync(TestFailedAppendAndSync.java:139)
	at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1235)
	at java.lang.Thread.run(Thread.java:745)
2016-04-06 20:20:21,378 ERROR [Thread-4] wal.FSHLog(881): Failed close of WAL writer /home/zhangduo/hbase/code/hbase-server/target/test-data/dba1afc0-933c-4ac0-ad0c-1688e8e152b5/TestHRegiontestLockupAroundBadAssignSync/testLockupAroundBadAssignSync/wal.1459945205555,
unflushedEntries=7
org.apache.hadoop.hbase.regionserver.wal.FailedSyncBeforeLogCloseException: java.io.IOException:
FAKE! Failed to replace a bad datanode...
	at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1615)
	at org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:833)
	at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:699)
	at org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog.rollWriter(TestFailedAppendAndSync.java:122)
	at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:148)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: FAKE! Failed to replace a bad datanode...
	at org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog$1.sync(TestFailedAppendAndSync.java:139)
	at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1235)
	... 1 more
2016-04-06 20:21:07,405 INFO  [Thread-4] regionserver.LogRoller(176): LogRoller exiting.
{noformat}

You can see that, the second sync error will cause a FailedSyncBeforeLogCloseException and
trigger an abort.

Can not reproduce it locally right now. Open a issue for it? This maybe a dataloss issue...[~stack]

Thanks.

> Make multi WAL work with WALs other than FSHLog
> -----------------------------------------------
>
>                 Key: HBASE-15537
>                 URL: https://issues.apache.org/jira/browse/HBASE-15537
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>             Fix For: 2.0.0, 1.3.0, 1.4.0
>
>         Attachments: HBASE-15537-branch-1.patch, HBASE-15537-v3.patch, HBASE-15537-v4.patch,
HBASE-15537-v5.patch, HBASE-15537-v6.patch, HBASE-15537.patch, HBASE-15537_v2.patch
>
>
> The multi WAL should not be bound with {{FSHLog}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message