From user-return-56070-archive-asf-public=cust-asf.ponee.io@hbase.apache.org Tue Jul 23 14:30:34 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 854171802C7 for ; Tue, 23 Jul 2019 16:30:34 +0200 (CEST) Received: (qmail 69051 invoked by uid 500); 23 Jul 2019 14:30:32 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 69032 invoked by uid 99); 23 Jul 2019 14:30:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jul 2019 14:30:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 893601A41BE for ; Tue, 23 Jul 2019 14:30:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.8 X-Spam-Level: * X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id iibnK4iwQ1Cp for ; Tue, 23 Jul 2019 14:30:29 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::e41; helo=mail-vs1-xe41.google.com; envelope-from=openinx@gmail.com; receiver= Received: from mail-vs1-xe41.google.com (mail-vs1-xe41.google.com [IPv6:2607:f8b0:4864:20::e41]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 6AC677E22D for ; Tue, 23 Jul 2019 14:30:28 +0000 (UTC) Received: by mail-vs1-xe41.google.com with SMTP id 190so28921046vsf.9 for ; Tue, 23 Jul 2019 07:30:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=5pSDskTtL9yZZJbJyWfnOIBfilO1tSBBBTFwDDRhkm0=; b=sK0CL3hgv+2PQ/livrvHuTJfyM00/dtq3TlvHFKyFZJvWENi+gpZWV3noIoOu+pKyq tL0sPzA8eIEGfSTu13UZZXvk1+pXD+KbFjqULyHAkTuJ5m6m0+rZNPnKpQaqoawDaO/g mim39UDYIxQ1I2Y7QSa8lefZ6cLh1sbIM3jvS6YypjM4Po3u4jqz2WsNN7172nJ+oxWj AfJOSRevgqIABodLs+YOBY8ZzoQIAo9JTjNGoMYmj3uAG5ys7N0+ZgSQ4ykw33IUdpyB DM9vTIX1m/3i/yibAIiVTiJQDwQRym3ypcJAdrA33DIxkpeCOo5gAaDR+o1rzKRKX1eo RXEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=5pSDskTtL9yZZJbJyWfnOIBfilO1tSBBBTFwDDRhkm0=; b=Vw2gEyZmM46ZhfAIZGjPFKFkAX4rD5Cd8Eozf08ucFlsC2EEmgHPclEg2bUNaAp5XH wikebyk+pdBM0nvQ5mxJZfL40cUjIZBeEheGScjxS6Ou9nuy1GXvE1KnjCdGG7sODJhI mKRKCrTkg4AFewDmHwSnsE0cr+nHA07bP1AYIrl0KWpeEM2rYxSGMlwE0tivDCQbGASj o1ewITZRDL7nPbsfJMKAoPqaVdwUpPcj9tDjtMoTdUxKRGheZ/ei1CVVoai/ylG2aAYW MJPH49OlecIqXbEvpECCOu9XyFEpLKUr8BjTiD/TFBTiR8xJCd+9t+oiwsbUTuFOfp8X ipQw== X-Gm-Message-State: APjAAAVeCLUfE7uxzIXVFh2hEBj1HKauyhN17lQnVO8lN8N8GIRNGQcT BNDaeYl5pETkzwlq0AB8wrEq6aE3mCh84hZRb6DP1b2S+2jOfA== X-Google-Smtp-Source: APXvYqyDCX1cSoQdIjoljrz1c1Jq1Y5BQNEx39setSB0hJDZKFJ3zSwJvp3e6zYU/bu6jxjWXqBLbvNXTyYG0zGnXjo= X-Received: by 2002:a67:f618:: with SMTP id k24mr47523763vso.66.1563892226784; Tue, 23 Jul 2019 07:30:26 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: OpenInx Date: Tue, 23 Jul 2019 22:30:15 +0800 Message-ID: Subject: Re: TimeoutException on Snapshots To: Hbase-User Content-Type: multipart/alternative; boundary="000000000000a50818058e5a08bc" --000000000000a50818058e5a08bc Content-Type: text/plain; charset="UTF-8" > My question is: is it safe to ignore these TimeoutExceptions? if the SnapshotRegionManifests are not being written due to a timeout does that mean we are losing data or getting inconsistencies? I don't think ignoring the TimeoutException is a good idea. You need to find out why did the snapshot taking timeout. I also encountered the similar case in my cluster, the cluster have 30 RS, but the table have 1600 regions, means each RS need to flush & write the region mainfests for 50+ regions. while the SnapshotSubprocedurePool has a default 3 threadSize...it's easy to timeout for that case because of the pool is too small. So I enlarged the config keys (Note that increasing the 'hbase.snapshot.master.timeout.millis' is not enough, because the RS can also be timeout) and rolled update the clusters, it works pretty good for me now. hbase.snapshot.master.timeout.millis=1200000 hbase.snapshot.region.timeout=1200000 hbase.snapshot.region.concurrentTasks=20 Hope it will be helpfull for you , Arwin. On Tue, Jul 23, 2019 at 3:55 PM Arwin Tio wrote: > Hi all, > > I've been running into these issues after restoring from snapshots: > > https://issues.apache.org/jira/browse/HBASE-16464 > https://issues.apache.org/jira/browse/HBASE-17992 > > Essentially, HRegion#addRegionToSnapshot has been timing out in > TakeSnapshotHandler, resulting in some leftover tmp files. The leftover tmp > files causes archivedHFileCleaner, which manifests in an extremely large > archive folder that doesn't get cleaned up. > > HBASE-16464 solves the bloating archive folder by preventing the > SnapshotRegionManifest from being written if the operation has timed out > (see: > https://github.com/apache/hbase/commit/ab011391ab392f1a62b6ea9bdca87fc950af42a9#diff-4ec74c1b12f2be4f52c33260fd8b73efR86 > ) > > My question is: is it safe to ignore these TimeoutExceptions? if the > SnapshotRegionManifests are not being written due to a timeout does that > mean we are losing data or getting inconsistencies? > > If so, what are some potential remedies for this? I'm thinking we can just > increase the timeout 'hbase.snapshot.master.timeout.millis' but is there a > better way? > > Thanks > --000000000000a50818058e5a08bc--