hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Heads up: merge for QJM branch soon
Date Thu, 20 Sep 2012 00:31:49 GMT
Hi Konstantin,

I'd be open to that. I know that we plan to ship it as a recommended
option for CDH, but if the community feels it's better suited to
compile as a separate module, either inside or outside of Hadoop, I'm
open to discussing that. You're absolutely correct that it is meant to
"stand alone" in its own HDFS package.

-Todd

On Wed, Sep 19, 2012 at 5:20 PM, Konstantin Shvachko
<shv.hadoop@gmail.com> wrote:
> Hi Todd,
>
> I was wondering if you considered to make QuorumJournal a separate
> project or subproject.
> Given that
> - it is 6600 lines of code
> - the code is all new
> - well separated in a separate package
> - implements reliable journaling, which can have alternative
> approaches (say Bookeeper)
>
> Taking all that into account it seems this work by itself can and
> should form a project.
> As an alternative of merging it under the HDFS umbrella.
>
> Thanks,
> --Konstantin
>
> On Wed, Sep 19, 2012 at 12:53 PM, Todd Lipcon <todd@cloudera.com> wrote:
>> Hi all,
>>
>> Work has been progressing steadily for the last several months on the
>> HDFS-3077 (QuorumJournalManager) branch. The branch is now "feature
>> complete", and I believe ready for a merge soon now. Here is a brief
>> overview of some salient points:
>>
>> * The QJM can be configured to act as a shared journal directory much
>> the same way that the BKJM can. It supports all of the same operations
>> as the other JMs in trunk, and fulfills all of the stated design goals
>> from the design document.
>> * Documentation has been added to the HA guide to explain how to set
>> up the QJM as a shared edits mechanism.
>> * Metrics have been added to both the NameNode (client side) and
>> JournalNode (server side). These metrics should be sufficient to
>> detect potential issues both in terms of availability and in operation
>> latency.
>> * Security is fully implemented with the normal mechanisms
>> * The branch has been swept for findbugs, javac warnings, license
>> headers, etc, and should be "good to go" by those standards. It is
>> also fully up-to-date with trunk.
>> * The design doc on HDFS-3077 has been updated to reflect the final
>> state of the implementation.
>>
>> As this is a critical part of NN metadata storage, I imagine people
>> will be interested to hear about the test status as well:
>>
>> * Unit/functional test coverage is pretty high. As a rough measure,
>> there are ~2300 lines of test code (via sloccount, ie not counting
>> comments etc) vs 3300 lines of non-test code.  The test coverage of
>> the new package is as follows: 92% coverage of client code, 86%
>> coverage of server code. The uncovered areas are mostly
>> assertion/sanity checks that never fail, and security code which can't
>> be tested by unit tests.
>> * In terms of state-space coverage, there is one randomized stress
>> test which injects faults randomly. This test has caught almost every
>> bug in the design or implementation, and alone covers ~80% of the
>> code. Since it is randomized, running it repeatedly covers more areas
>> of the state-space  -- I've written a small MR job which runs it in
>> parallel on a Hadoop cluster, and have run it for many slot-years.
>> (Actually, this test case uncovered an unrelated Jetty bug that has
>> been hitting us for a couple years now!)
>> * In terms of actual cluster testing:
>> ** We have done cluster testing of a small secure HA cluster using QJM
>> for shared edits. This covers the security code which cannot be
>> covered by the automated tests.
>> ** We have been running a QJM-based HA setup on a 100-node test
>> cluster for several weeks with no new issues in quite some time. This
>> cluster runs a mixed QA workload - eg hive benchmarks,
>> teragen/terasort, gridmix, etc.
>> ** We have tested failover/failback in both small and large clusters.
>>
>> In terms of performance:
>>
>> I collected the logs from the above-mentioned 100-node test cluster,
>> and looked at the "SyncTimes" reported by the periodic FSEditLog
>> statistics printout. The active NN in this cluster is configured to
>> write to two local disks, and write shared edits to a
>> QuorumJournalManager. The QJM is configured with three JournalNodes,
>> all in the same rack as the active NN, and one on the same machine as
>> the active NN. The average sync times are: QJM: 6.6ms, Local disk 1:
>> 7.1ms, Local disk 2: 5.7ms. The maximum times seen in the log are:
>> QJM: 19.8ms, Local 1: 30.8ms, Local 2: 22.8ms. This shows that the
>> quorum behavior achieves the goal of smoothing out latencies, since it
>> can proceed even if one of the underlying disks is temporarily slow.
>>
>> I also ran some manual tests by running a loop like: while true ; do
>> sleep 0.1 ; kill -STOP $PID_OF_JN ; sleep 0.5 ; kill -CONT $PID_OF_JN
>> ; done. This allows the JN to only run 100ms out of every 600ms. I
>> applied a client load to the NN and verified that the NN's operation
>> latency was unaffected even though one of the JNs slowly was falling
>> behind.
>>
>> In terms of risk assessment for the merge:
>>
>> - This feature is entirely optional. The changes to existing code in
>> the branch are fairly minimal improvements to the support for
>> pluggable JournalManagers. We have been merging these changes into our
>> CDH nightly tree and performed all of our usual daily QA and
>> integration against these builds, so I feel confident that they do not
>> introduce any new risk.
>> - Even when the feature is enabled for shared edits, users will likely
>> continue to log edits to locally mounted FileJournalManagers in
>> addition to the shared QJM. So, if there is any bug in the QJM, the
>> system metadata is not at risk.
>> - Overall, we have taken a conservative approach and favored
>> durability/correctness over availability. For example, we have many
>> extra sanity checks and assertions to check our assumptions, and if
>> any fail, we will abort rather than continue on with a risk of
>> data-loss.
>>
>> So, in summary, I think the branch is very nearly ready to be merged
>> into trunk. I will continue to perform stress tests, but at this point
>> I am not aware of any deficiencies which would be considered a
>> blocker. If anyone has any questions about the code, design, or tests,
>> please feel free to chime in on HDFS-3077 or the relevant subtasks.
>>
>> Thanks
>> Todd
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message