hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Shvachko <shv.had...@gmail.com>
Subject Re: Heads up: merge for QJM branch soon
Date Thu, 20 Sep 2012 00:20:05 GMT
Hi Todd,

I was wondering if you considered to make QuorumJournal a separate
project or subproject.
Given that
- it is 6600 lines of code
- the code is all new
- well separated in a separate package
- implements reliable journaling, which can have alternative
approaches (say Bookeeper)

Taking all that into account it seems this work by itself can and
should form a project.
As an alternative of merging it under the HDFS umbrella.

Thanks,
--Konstantin

On Wed, Sep 19, 2012 at 12:53 PM, Todd Lipcon <todd@cloudera.com> wrote:
> Hi all,
>
> Work has been progressing steadily for the last several months on the
> HDFS-3077 (QuorumJournalManager) branch. The branch is now "feature
> complete", and I believe ready for a merge soon now. Here is a brief
> overview of some salient points:
>
> * The QJM can be configured to act as a shared journal directory much
> the same way that the BKJM can. It supports all of the same operations
> as the other JMs in trunk, and fulfills all of the stated design goals
> from the design document.
> * Documentation has been added to the HA guide to explain how to set
> up the QJM as a shared edits mechanism.
> * Metrics have been added to both the NameNode (client side) and
> JournalNode (server side). These metrics should be sufficient to
> detect potential issues both in terms of availability and in operation
> latency.
> * Security is fully implemented with the normal mechanisms
> * The branch has been swept for findbugs, javac warnings, license
> headers, etc, and should be "good to go" by those standards. It is
> also fully up-to-date with trunk.
> * The design doc on HDFS-3077 has been updated to reflect the final
> state of the implementation.
>
> As this is a critical part of NN metadata storage, I imagine people
> will be interested to hear about the test status as well:
>
> * Unit/functional test coverage is pretty high. As a rough measure,
> there are ~2300 lines of test code (via sloccount, ie not counting
> comments etc) vs 3300 lines of non-test code.  The test coverage of
> the new package is as follows: 92% coverage of client code, 86%
> coverage of server code. The uncovered areas are mostly
> assertion/sanity checks that never fail, and security code which can't
> be tested by unit tests.
> * In terms of state-space coverage, there is one randomized stress
> test which injects faults randomly. This test has caught almost every
> bug in the design or implementation, and alone covers ~80% of the
> code. Since it is randomized, running it repeatedly covers more areas
> of the state-space  -- I've written a small MR job which runs it in
> parallel on a Hadoop cluster, and have run it for many slot-years.
> (Actually, this test case uncovered an unrelated Jetty bug that has
> been hitting us for a couple years now!)
> * In terms of actual cluster testing:
> ** We have done cluster testing of a small secure HA cluster using QJM
> for shared edits. This covers the security code which cannot be
> covered by the automated tests.
> ** We have been running a QJM-based HA setup on a 100-node test
> cluster for several weeks with no new issues in quite some time. This
> cluster runs a mixed QA workload - eg hive benchmarks,
> teragen/terasort, gridmix, etc.
> ** We have tested failover/failback in both small and large clusters.
>
> In terms of performance:
>
> I collected the logs from the above-mentioned 100-node test cluster,
> and looked at the "SyncTimes" reported by the periodic FSEditLog
> statistics printout. The active NN in this cluster is configured to
> write to two local disks, and write shared edits to a
> QuorumJournalManager. The QJM is configured with three JournalNodes,
> all in the same rack as the active NN, and one on the same machine as
> the active NN. The average sync times are: QJM: 6.6ms, Local disk 1:
> 7.1ms, Local disk 2: 5.7ms. The maximum times seen in the log are:
> QJM: 19.8ms, Local 1: 30.8ms, Local 2: 22.8ms. This shows that the
> quorum behavior achieves the goal of smoothing out latencies, since it
> can proceed even if one of the underlying disks is temporarily slow.
>
> I also ran some manual tests by running a loop like: while true ; do
> sleep 0.1 ; kill -STOP $PID_OF_JN ; sleep 0.5 ; kill -CONT $PID_OF_JN
> ; done. This allows the JN to only run 100ms out of every 600ms. I
> applied a client load to the NN and verified that the NN's operation
> latency was unaffected even though one of the JNs slowly was falling
> behind.
>
> In terms of risk assessment for the merge:
>
> - This feature is entirely optional. The changes to existing code in
> the branch are fairly minimal improvements to the support for
> pluggable JournalManagers. We have been merging these changes into our
> CDH nightly tree and performed all of our usual daily QA and
> integration against these builds, so I feel confident that they do not
> introduce any new risk.
> - Even when the feature is enabled for shared edits, users will likely
> continue to log edits to locally mounted FileJournalManagers in
> addition to the shared QJM. So, if there is any bug in the QJM, the
> system metadata is not at risk.
> - Overall, we have taken a conservative approach and favored
> durability/correctness over availability. For example, we have many
> extra sanity checks and assertions to check our assumptions, and if
> any fail, we will abort rather than continue on with a risk of
> data-loss.
>
> So, in summary, I think the branch is very nearly ready to be merged
> into trunk. I will continue to perform stress tests, but at this point
> I am not aware of any deficiencies which would be considered a
> blocker. If anyone has any questions about the code, design, or tests,
> please feel free to chime in on HDFS-3077 or the relevant subtasks.
>
> Thanks
> Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera

Mime
View raw message