Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 25F5517220 for ; Tue, 7 Oct 2014 03:56:12 +0000 (UTC) Received: (qmail 90956 invoked by uid 500); 7 Oct 2014 03:56:08 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 90862 invoked by uid 500); 7 Oct 2014 03:56:08 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 90755 invoked by uid 99); 7 Oct 2014 03:56:08 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Oct 2014 03:56:08 +0000 Received: from mail-la0-f54.google.com (mail-la0-f54.google.com [209.85.215.54]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id C414E1A0692 for ; Tue, 7 Oct 2014 03:56:06 +0000 (UTC) Received: by mail-la0-f54.google.com with SMTP id gm9so5436736lab.41 for ; Mon, 06 Oct 2014 20:56:04 -0700 (PDT) X-Received: by 10.112.150.230 with SMTP id ul6mr72549lbb.103.1412654164854; Mon, 06 Oct 2014 20:56:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.160.206 with HTTP; Mon, 6 Oct 2014 20:55:24 -0700 (PDT) From: Andrew Purtell Date: Mon, 6 Oct 2014 20:55:24 -0700 Message-ID: Subject: An important read To: "dev@hbase.apache.org" Content-Type: multipart/alternative; boundary=047d7b3435eaafd16c0504cd2dc7 --047d7b3435eaafd16c0504cd2dc7 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-intensive Systems Yuan et. al, University of Toronto Large, production quality distributed systems still fail periodically, and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures. We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code =E2=80=93 the= last line of defense =E2=80=93 even without an understanding of the software des= ign. We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been prevented had Aspirator been used and the identified bugs fixed. Running Aspirator on the code of 9 distributed systems located 143 bugs and bad practices that have been fixed or confirmed by the developers. =E2=80=8BThis is an interesting benefit of open source and open development process. Please read this detailed analysis of availability and data loss bugs resulting from improper error handling, in HBase and other systems. The authors focus on a particular pattern of defect and cause. The point is well taken. It would be worth taking time where possible to revisit exception handling, especially where we have low test coverage. Also, consider HBASE-11912. The static analyses mentioned in this paper could likely be implemented with error-prone. Development and code review will always be uneven in a volunteer open source project. However if we agree on some baseline practices, and those are amenable to static analysis, then we could build validation of those practices into the compiler, in effect. --=20 Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) --047d7b3435eaafd16c0504cd2dc7--