Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id EFCF6200CC8 for ; Fri, 14 Jul 2017 20:45:31 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id EE3AC16E311; Fri, 14 Jul 2017 18:45:31 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 16EAC16E310 for ; Fri, 14 Jul 2017 20:45:30 +0200 (CEST) Received: (qmail 34478 invoked by uid 500); 14 Jul 2017 18:45:30 -0000 Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list dev@accumulo.apache.org Received: (qmail 34465 invoked by uid 99); 14 Jul 2017 18:45:29 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jul 2017 18:45:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 78E0F1A0697 for ; Fri, 14 Jul 2017 18:45:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.379 X-Spam-Level: ** X-Spam-Status: No, score=2.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id Jv4SQuq0V8-F for ; Fri, 14 Jul 2017 18:45:27 +0000 (UTC) Received: from mail-ua0-f170.google.com (mail-ua0-f170.google.com [209.85.217.170]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id E7F355FBDF for ; Fri, 14 Jul 2017 18:45:26 +0000 (UTC) Received: by mail-ua0-f170.google.com with SMTP id 35so9830226uax.3 for ; Fri, 14 Jul 2017 11:45:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=S4plfT10K8RpQMcMv2ss5BM8a89UJoAmio5VN2EDocI=; b=q0aBZvdpZugj2SjVJRrDdjIZR2s9EbMGqM9rwcH8zKqMHkJFJ8CkrAv8+W9947iOEU w8QbVgBhTAbhlzkNK3z2pPGs/n5fdinSrrys/BWa0cmdXQX6oGsgNAlGnqaOfIGhG+Of qfubhI4qR4ZNI3VZvtoxQWf8tZ4eaFmIkFyIfzd8OL19FD/P0MdB2Nm5DZvqmr/GCqeh 1GBrv+CDwcywCKvXQocupf6fhB1/NacFvIOaVBz2l0qYtGDUhFs7J8d0jh+4ohg19kgp UZp2R2qCIId9zUWjeZ+YRQAafLV1/IeilTlsYGBgebxjQDZKFsn5K+wLaE+rKB10Kqp4 ZcFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=S4plfT10K8RpQMcMv2ss5BM8a89UJoAmio5VN2EDocI=; b=FNZDULpA2YX9yJ9AbsnxLnZixBGRp3bpPnWwZN24GxVRtQVq3dCKddOQ6XHYk1UIBZ Jav19PO+ZaVnkjRtSTXLlhbKE02G+j5fOpvB/g3iCH00bhnr3SXw6bTrKSFVSADRFTzK Kz3kI8aNTfQziDl78QDFzk/fIobZxfYqGNSFGJDZuif9Liew0sv1vu8sxzY2gL97nT/8 P9AKGDT7dqY4YUy23+zPC87WCz3EUqwHyjihu8N79WqHc52wiTyODWE1OA035HA8iBSx AQVKO8OnOTYiiXrCCObeYR92iAWVCQoa1cr/28ZhInujxOhGtHkd5Vw//Hd7CmGc9MDU QaqA== X-Gm-Message-State: AIVw111Du87iDfOQK954JSOcIt1wWQZZ+nePTnJIYFkvmWq8ZZrMfw3Y mh4oqa/mebWClM9pP9D/vmXAfqKsMg== X-Received: by 10.159.48.214 with SMTP id k22mr6606138uab.31.1500057926296; Fri, 14 Jul 2017 11:45:26 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.47.149 with HTTP; Fri, 14 Jul 2017 11:45:05 -0700 (PDT) In-Reply-To: References: From: Jeff Kubina Date: Fri, 14 Jul 2017 14:45:05 -0400 Message-ID: Subject: Re: [DISCUSS] Periodic table exports To: dev@accumulo.apache.org Content-Type: multipart/alternative; boundary="f403045dada2d6ec5705544b73cb" archived-at: Fri, 14 Jul 2017 18:45:32 -0000 --f403045dada2d6ec5705544b73cb Content-Type: text/plain; charset="UTF-8" Wouldn't it be better to have a utility method that reads all the splits from the table's rfiles that outputs them to a file? We could then use the file to recreate the table with the pre-existing splits. -- Jeff Kubina 410-988-4436 On Fri, Jul 14, 2017 at 2:26 PM, Sean Busbey wrote: > This could also be useful for botched upgrades > (should we change stuff in meta again). > > Don't we already default replication of the blocks for the meta tables > to something very high? Aren't the exported-to-HDFS things just as > subject to block corruption, or more-so if they use default > replication? > > I think if we automate something like this, to Mike's point about set > & pray, we'd have to also build in automated periodic checks on if the > stored information is useful so that operators can be alerted. > > Can we sketch what testing looks like? > > Christopher, can you get some estimates on what kind of volume we're > talking about here? Seems like it'd be small. > > On Fri, Jul 14, 2017 at 1:07 PM, Christopher wrote: > > The problem is HDFS corrupt blocks which affect the metadata tables. I > > don't know that this window is all that narrow. I've seen corrupt blocks > > far more often than HDFS outages. Some due to HDFS bugs, some due to > > hardware failures and too few replicas, etc. We know how to recover > corrupt > > blocks in user tables (accepting data loss) by essentially replacing a > > corrupt file with an empty one. But, we don't really have a good way to > > recover when the corrupt blocks occur in metadata tables. That's what > this > > would address. > > > > On Fri, Jul 14, 2017 at 1:47 PM Mike Drob wrote: > > > >> What's the risk that we are trying to address? > >> > >> Storing data locally won't help in case of a namenode failure. If you > have > >> a failure that's severe enough to actually kill blocks but not severe > >> enough that your HDFS is still up, that's a pretty narrow window. > >> > >> How do you test that your backups are good? That you haven't lost any > data > >> there? Or is it a set and forget (and pray?) > >> > >> This seems like something that is not worth while to automate because > >> everybody is going to have such different needs. Write a blog post, then > >> push people onto existing backup/disaster recovery solutions, including > off > >> site storage, etc. If they're not already convinced that they need this, > >> then their data likely isn't that valuable to begin with. If this same > >> problem happens multiple times to the same user... I don't think a > periodic > >> export table will help them. > >> > >> Mike > >> > >> On Fri, Jul 14, 2017 at 12:29 PM, Christopher > wrote: > >> > >> > I saw a user running a very old version of Accumulo run into a pretty > >> > severe failure, where they lost an HDFS block containing part of their > >> root > >> > tablet. This, of course, will cause a ton of problems. Without the > root > >> > tablet, you can't recover the metadata table, and without that, you > can't > >> > recover your user tables. > >> > > >> > Now, you can recover the RFiles, of course... but without knowing the > >> split > >> > points, you can run into all sorts of problems trying to restore an > >> > Accumulo instance from just these RFiles. > >> > > >> > We have an export table feature which creates a snapshot of the split > >> > points for a table, allowing a user to relatively easily recover from > a > >> > serious failure, provided the RFiles are available. However, that > >> requires > >> > a user to manually run it on occasion, which of course does not > happen by > >> > default. > >> > > >> > I'm interested to know what people think about possibly doing > something > >> > like this internally on a regular basis. Maybe hourly by default, > >> performed > >> > by the Master for all user tables, and saved to a file in /accumulo on > >> > HDFS? > >> > > >> > The closest think I can think of to this, which has saved me more than > >> > once, is the way Chrome and Firefox backup open tabs and bookmarks > >> > regularly, to restore from a crash. > >> > > >> > Users could already be doing this on their own, so it's not really > >> > necessary to bake it in... but as we all probably know... people are > >> really > >> > bad at customizing away from defaults. > >> > > >> > What are some of the issues and trade-offs of incorporating this as a > >> > default feature? What are some of the issues we'd have to address with > >> it? > >> > What would its configuration look like? Should it be on by default? > >> > > >> > Perhaps a simple blog describing a custom user service running > alongside > >> > Accumulo which periodically runs "export table" would suffice? (this > is > >> > what I'm leaning towards, but the idea of making it default is > >> compelling, > >> > given the number of times I've seen users struggle to plan for or > respond > >> > to catastrophic failures, especially at the storage layer). > >> > > >> > > > > -- > busbey > --f403045dada2d6ec5705544b73cb--