Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 96957200CA3 for ; Thu, 1 Jun 2017 22:03:08 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 955C6160BDF; Thu, 1 Jun 2017 20:03:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B721D160BC4 for ; Thu, 1 Jun 2017 22:03:07 +0200 (CEST) Received: (qmail 17005 invoked by uid 500); 1 Jun 2017 20:03:06 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 16994 invoked by uid 99); 1 Jun 2017 20:03:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Jun 2017 20:03:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 7E13BCA9DC for ; Thu, 1 Jun 2017 20:03:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.201 X-Spam-Level: X-Spam-Status: No, score=-99.201 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id zJYrAMEMH4mW for ; Thu, 1 Jun 2017 20:03:05 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 6D2025FBB8 for ; Thu, 1 Jun 2017 20:03:05 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id DA26EE0BCA for ; Thu, 1 Jun 2017 20:03:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 46AAF21B5B for ; Thu, 1 Jun 2017 20:03:04 +0000 (UTC) Date: Thu, 1 Jun 2017 20:03:04 +0000 (UTC) From: "stack (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-18143) [AMv2] Backoff on failed report of region transition quickly goes to astronomical time scale MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 01 Jun 2017 20:03:08 -0000 [ https://issues.apache.org/jira/browse/HBASE-18143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-18143: -------------------------- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Pushed to master branch. Thanks for the review [~uagashe] > [AMv2] Backoff on failed report of region transition quickly goes to astronomical time scale > -------------------------------------------------------------------------------------------- > > Key: HBASE-18143 > URL: https://issues.apache.org/jira/browse/HBASE-18143 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Affects Versions: 2.0.0 > Reporter: stack > Assignee: stack > Priority: Critical > Fix For: 2.0.0 > > Attachments: HBASE-18143.master.001.patch, HBASE-18143.master.002.patch, HBASE-18143.master.002.patch > > > Testing on cluster w/ aggressive killing, if Master is killed serially a few times such that is offline a good while, regionservers that want to report a region transition pause too long between retries. > Here is the regionserver reporting failures: > {code} > 1 2017-05-31 20:50:53,840 INFO [RS_CLOSE_REGION-ve0542:16020-2] regionserver.HRegionServer: Failed report of region transition server { host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 } transition { transition_code: CLOSED region_info { region_id: 1496284931226 table_name { namesp ace: "default" qualifier: "IntegrationTestBigLinkedList" } start_key: "\337\377\377\377\377\377\377\362" end_key: "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 } }; retry (#0) after 1008ms delay (Master is coming online...). > 2 2017-05-31 20:50:54,853 INFO [RS_CLOSE_REGION-ve0542:16020-2] regionserver.HRegionServer: Failed report of region transition server { host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 } transition { transition_code: CLOSED region_info { region_id: 1496284931226 table_name { namesp ace: "default" qualifier: "IntegrationTestBigLinkedList" } start_key: "\337\377\377\377\377\377\377\362" end_key: "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 } }; retry (#1) after 2026ms delay (Master is coming online...). > 3 2017-05-31 20:50:56,886 INFO [RS_CLOSE_REGION-ve0542:16020-2] regionserver.HRegionServer: Failed report of region transition server { host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 } transition { transition_code: CLOSED region_info { region_id: 1496284931226 table_name { namesp ace: "default" qualifier: "IntegrationTestBigLinkedList" } start_key: "\337\377\377\377\377\377\377\362" end_key: "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 } }; retry (#2) after 6084ms delay (Master is coming online...). > 4 2017-05-31 20:51:02,976 INFO [RS_CLOSE_REGION-ve0542:16020-2] regionserver.HRegionServer: Failed report of region transition server { host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 } transition { transition_code: CLOSED region_info { region_id: 1496284931226 table_name { namesp ace: "default" qualifier: "IntegrationTestBigLinkedList" } start_key: "\337\377\377\377\377\377\377\362" end_key: "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 } }; retry (#3) after 30588ms delay (Master is coming online...). > 5 2017-05-31 20:51:33,570 INFO [RS_CLOSE_REGION-ve0542:16020-2] regionserver.HRegionServer: Failed report of region transition server { host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 } transition { transition_code: CLOSED region_info { region_id: 1496284931226 table_name { namesp ace: "default" qualifier: "IntegrationTestBigLinkedList" } start_key: "\337\377\377\377\377\377\377\362" end_key: "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 } }; retry (#4) after 308422ms delay (Master is coming online...). > 6 2017-05-31 20:56:41,997 INFO [RS_CLOSE_REGION-ve0542:16020-2] regionserver.HRegionServer: Failed report of region transition server { host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 } transition { transition_code: CLOSED region_info { region_id: 1496284931226 table_name { namesp ace: "default" qualifier: "IntegrationTestBigLinkedList" } start_key: "\337\377\377\377\377\377\377\362" end_key: "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 } }; retry (#5) after 6171203ms delay (Master is coming online...). > {code} > See how by the time we get to the 5th retry, we are waiting 100 minutes before we'll retry. That is too long. Make retry happen more frequently. Data is offline until the close is successfully reported. -- This message was sent by Atlassian JIRA (v6.3.15#6346)