hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Duo Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only
Date Sat, 06 Feb 2016 02:54:40 GMT

    [ https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135501#comment-15135501
] 

Duo Zhang commented on HBASE-14790:
-----------------------------------

[~stack] Oh there is a WALPE tool, I didn't know it before, I have run a randomWrite test
in the PerformanceEvaluation tool...

The code quality is not good enough for merging it now. And there are two problems before
I start working on preparing a patch

1. Where should we place the FanOut stream. I use lots of reflection and some methods only
visible to tests in HDFS to implement the new stream. Since we could get a better performance,
is it enough to make HDFS guys accept it as part of the HDFS project?

2. I do not introduce a new WALProvider. Since it still writes data on HDFS, I just introduce
an AsyncFSHLog which shares a base class(AbstractFSHLog in the HBASE-14790 branch) of FSHLog
and add a flag to tell DefaultWALProvider it should use FSHLog or AsyncFSHLog. And also, I
introduce a new AsyncWriter interface. The append method of AsyncWriter only buffers data
in memory. What do you think [~stack] and [~busbey]? Do you guys have other ideas of how to
integrate the async logic in WAL?

Thanks.

> Implement a new DFSOutputStream for logging WAL only
> ----------------------------------------------------
>
>                 Key: HBASE-14790
>                 URL: https://issues.apache.org/jira/browse/HBASE-14790
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all purposes. But
in fact, we do not need most of the features if we only want to log WAL. For example, we do
not need pipeline recovery since we could just close the old logger and open a new one. And
also, we do not need to write multiple blocks since we could also open a new logger if the
old file is too large.
> And the most important thing is that, it is hard to handle all the corner cases to avoid
data loss or data inconsistency(such as HBASE-14004) when using original DFSOutputStream due
to its complicated logic. And the complicated logic also force us to use some magical tricks
to increase performance. For example, we need to use multiple threads to call {{hflush}} when
logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when logging WAL.
For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message