drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Chen <tnac...@gmail.com>
Subject Fwd: An important read
Date Tue, 07 Oct 2014 16:44:30 GMT
I think it's relevant for us, we should consider running the analysis tool too.


---------- Forwarded message ----------
From: Stack <stack@duboce.net>
Date: Tue, Oct 7, 2014 at 8:10 AM
Subject: Re: An important read
To: HBase Dev List <dev@hbase.apache.org>

Nkeywal points out HBASE-10452 has fixes for problems found by the
Aspirator tool mentioned in the paper.

I made HBASE-12187, "Review in source the paper "Simple Testing Can Prevent
Most Critical Failures", a critical against 1.0. Lets run through their
list of 'catastrophic failures' before we cut the 1.0 release.


On Mon, Oct 6, 2014 at 8:55 PM, Andrew Purtell <apurtell@apache.org> wrote:

> https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf
> Simple Testing Can Prevent Most Critical Failures: An Analysis of
> Production Failures in Distributed Data-intensive Systems
> Yuan et. al, University of Toronto
> Large, production quality distributed systems still fail periodically, and
> do so sometimes catastrophically, where most or all users experience an
> outage or data loss. We present the result of a comprehensive study
> investigating 198 randomly selected, user-reported failures that occurred
> on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop
> MapReduce, and Redis, with the goal of understanding how one or multiple
> faults eventually evolve into a user-visible failure. We found that from a
> testing point of view, almost all failures require only 3 or fewer nodes to
> reproduce, which is good news considering that these services typically run
> on a very large number of nodes. However, multiple inputs are needed to
> trigger the failures with the order between them being important. Finally,
> we found the error logs of these systems typically contain sufficient data
> on both the errors and the input events that triggered the failure,
> enabling the diagnose and the reproduction of the production failures.
> We found the majority of catastrophic failures could easily have been
> prevented by performing simple testing on error handling code – the last
> line of defense – even without an understanding of the software design. We
> extracted three simple rules from the bugs that have lead to some of the
> catastrophic failures, and developed a static checker, Aspirator, capable
> of locating these bugs. Over 30% of the catastrophic failures would have
> been prevented had Aspirator been used and the identified bugs fixed.
> Running Aspirator on the code of 9 distributed systems located 143 bugs and
> bad practices that have been fixed or confirmed by the developers.
> This is an interesting benefit of open source and open development
> process. Please read this detailed analysis of availability and data loss
> bugs resulting from improper error handling, in HBase and other systems.
> The authors focus on a particular pattern of defect and cause. The point is
> well taken. It would be worth taking time where possible to revisit
> exception handling, especially where we have low test coverage.
> Also, consider HBASE-11912. The static analyses mentioned in this paper
> could likely be implemented with error-prone. Development and code review
> will always be uneven in a volunteer open source project. However if we
> agree on some baseline practices, and those are amenable to static
> analysis, then we could build validation of those practices into the
> compiler, in effect.
> --
> Best regards,
>    - Andy
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

View raw message