Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id EA5E2200CC8 for ; Fri, 14 Jul 2017 20:08:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id E8E4D16E198; Fri, 14 Jul 2017 18:08:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3AF2616E190 for ; Fri, 14 Jul 2017 20:08:11 +0200 (CEST) Received: (qmail 15238 invoked by uid 500); 14 Jul 2017 18:08:10 -0000 Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list dev@accumulo.apache.org Received: (qmail 15227 invoked by uid 99); 14 Jul 2017 18:08:10 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jul 2017 18:08:10 +0000 Received: from mail-vk0-f42.google.com (mail-vk0-f42.google.com [209.85.213.42]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 0BEE51A0029 for ; Fri, 14 Jul 2017 18:08:10 +0000 (UTC) Received: by mail-vk0-f42.google.com with SMTP id f68so47058837vkg.2 for ; Fri, 14 Jul 2017 11:08:10 -0700 (PDT) X-Gm-Message-State: AIVw112DB6ukng3tO2T3BSY7uGmUdEu4Z+DpYxaLEDBB9rYN6A56lgs6 RI0WgMm2tyMs7iuGNxapoSobvQJEEA== X-Received: by 10.31.54.151 with SMTP id d145mr6270864vka.15.1500055689303; Fri, 14 Jul 2017 11:08:09 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Christopher Date: Fri, 14 Jul 2017 18:07:58 +0000 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [DISCUSS] Periodic table exports To: Accumulo Dev List Content-Type: multipart/alternative; boundary="001a11438746811e7b05544aee76" archived-at: Fri, 14 Jul 2017 18:08:12 -0000 --001a11438746811e7b05544aee76 Content-Type: text/plain; charset="UTF-8" The problem is HDFS corrupt blocks which affect the metadata tables. I don't know that this window is all that narrow. I've seen corrupt blocks far more often than HDFS outages. Some due to HDFS bugs, some due to hardware failures and too few replicas, etc. We know how to recover corrupt blocks in user tables (accepting data loss) by essentially replacing a corrupt file with an empty one. But, we don't really have a good way to recover when the corrupt blocks occur in metadata tables. That's what this would address. On Fri, Jul 14, 2017 at 1:47 PM Mike Drob wrote: > What's the risk that we are trying to address? > > Storing data locally won't help in case of a namenode failure. If you have > a failure that's severe enough to actually kill blocks but not severe > enough that your HDFS is still up, that's a pretty narrow window. > > How do you test that your backups are good? That you haven't lost any data > there? Or is it a set and forget (and pray?) > > This seems like something that is not worth while to automate because > everybody is going to have such different needs. Write a blog post, then > push people onto existing backup/disaster recovery solutions, including off > site storage, etc. If they're not already convinced that they need this, > then their data likely isn't that valuable to begin with. If this same > problem happens multiple times to the same user... I don't think a periodic > export table will help them. > > Mike > > On Fri, Jul 14, 2017 at 12:29 PM, Christopher wrote: > > > I saw a user running a very old version of Accumulo run into a pretty > > severe failure, where they lost an HDFS block containing part of their > root > > tablet. This, of course, will cause a ton of problems. Without the root > > tablet, you can't recover the metadata table, and without that, you can't > > recover your user tables. > > > > Now, you can recover the RFiles, of course... but without knowing the > split > > points, you can run into all sorts of problems trying to restore an > > Accumulo instance from just these RFiles. > > > > We have an export table feature which creates a snapshot of the split > > points for a table, allowing a user to relatively easily recover from a > > serious failure, provided the RFiles are available. However, that > requires > > a user to manually run it on occasion, which of course does not happen by > > default. > > > > I'm interested to know what people think about possibly doing something > > like this internally on a regular basis. Maybe hourly by default, > performed > > by the Master for all user tables, and saved to a file in /accumulo on > > HDFS? > > > > The closest think I can think of to this, which has saved me more than > > once, is the way Chrome and Firefox backup open tabs and bookmarks > > regularly, to restore from a crash. > > > > Users could already be doing this on their own, so it's not really > > necessary to bake it in... but as we all probably know... people are > really > > bad at customizing away from defaults. > > > > What are some of the issues and trade-offs of incorporating this as a > > default feature? What are some of the issues we'd have to address with > it? > > What would its configuration look like? Should it be on by default? > > > > Perhaps a simple blog describing a custom user service running alongside > > Accumulo which periodically runs "export table" would suffice? (this is > > what I'm leaning towards, but the idea of making it default is > compelling, > > given the number of times I've seen users struggle to plan for or respond > > to catastrophic failures, especially at the storage layer). > > > --001a11438746811e7b05544aee76--