Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Fri, 4 Dec 2015 08:03:11 +0000 (UTC)
From: "Duo Zhang (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12911749.1447124202000.269962.1449216191174@Atlassian.JIRA>
In-Reply-To: <JIRA.12911749.1447124202000@Atlassian.JIRA>
References: <JIRA.12911749.1447124202000@Atlassian.JIRA>
 <JIRA.12911749.1447124202700@arcas>
Subject: [jira] [Comment Edited] (HBASE-14790) Implement a new
 DFSOutputStream for logging WAL only
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041234#comment-15041234 ] 

Duo Zhang edited comment on HBASE-14790 at 12/4/15 8:02 AM:
------------------------------------------------------------

Considering these features:
Hflush is much faster than hsync, especially in pipeline mode. So we have to use hflush for hbase writing.
The data in DN that is hflushed but not hsynced may only in memory not disk, but it can be read by client.

So if we hflush data to DNs, and it is read by ReplicationSource and transferred to slave cluster, then three DNs and RS in master cluster crash. And after replaying WALs, slave will have data that master loses...

The only way to prevent any data losses is hsync every time but it is too slow, and I think most users can bear data lose to speed up writing operation but can not bear slave has more data than master.

Therefore, I think we can do these:
hflush every time, not fsync;
hsync periodically, for example, default per 1000ms? It can be configured by users, and users can also configure that we every each time, so there will not have any data loses unless all DNs disk fail...
RS tells "acked length" to ReplicationSource which is the data we hsynced, not hflushed. 
ReplicationSource only transfer data which is not larger than acked length. So the slave cluster will never have inconsistency.
WAL reading can handle  duplicate entries.
On WAL logging, if we get error on hflush, we open a new file and rewrite this entry, and recover/hsync/close old file asynchronously.


was (Author: yangzhe1991):
Considering these features:
Hflush is much faster than hsync, especially in pipeline mode. So we have to use hflush for hbase writing.
The data in DN that is hflushed but not hsynced may only in memory not disk, but it can be read by client.

So if we hflush data to DNs, and it is read by ReplicationSource and transferred to slave cluster, then three DNs and RS in master cluster crash. And after replaying WALs, slave will have data that master loses...

The only way to prevent any data losses is hsync every time but it is too slow, and I think most users can bear data lose to speed up writing operation but can not bear slave has more data than master.

Therefore, I think we can do these:
hflush every time, not fsync;
hfsync periodically, for example, default per 1000ms? It can be configured by users, and users can also configure that we hfsync each time, so there will not have any data loses unless all DNs disk fail...
RS tells "acked length" to ReplicationSource which is the data we hsynced, not hflushed. 
ReplicationSource only transfer data which is not larger than acked length. So the slave cluster will never have inconsistency.
WAL reading can handle  duplicate entries.
On WAL logging, if we get error on hflush, we open a new file and rewrite this entry, and recover/hsync/close old file asynchronously.

> Implement a new DFSOutputStream for logging WAL only
> ----------------------------------------------------
>
>                 Key: HBASE-14790
>                 URL: https://issues.apache.org/jira/browse/HBASE-14790
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all purposes. But in fact, we do not need most of the features if we only want to log WAL. For example, we do not need pipeline recovery since we could just close the old logger and open a new one. And also, we do not need to write multiple blocks since we could also open a new logger if the old file is too large.
> And the most important thing is that, it is hard to handle all the corner cases to avoid data loss or data inconsistency(such as HBASE-14004) when using original DFSOutputStream due to its complicated logic. And the complicated logic also force us to use some magical tricks to increase performance. For example, we need to use multiple threads to call {{hflush}} when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when logging WAL. For correctness, and also for performance.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)