Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id EE70A200C12 for ; Sun, 5 Feb 2017 11:22:34 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id ECFF4160B59; Sun, 5 Feb 2017 10:22:34 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 177C1160B32 for ; Sun, 5 Feb 2017 11:22:33 +0100 (CET) Received: (qmail 67405 invoked by uid 500); 5 Feb 2017 10:22:32 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 67393 invoked by uid 99); 5 Feb 2017 10:22:32 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Feb 2017 10:22:32 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 05E34C0A2F for ; Sun, 5 Feb 2017 10:22:32 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.379 X-Spam-Level: X-Spam-Status: No, score=0.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id gmyyfy7l-j0V for ; Sun, 5 Feb 2017 10:22:29 +0000 (UTC) Received: from mail-wm0-f46.google.com (mail-wm0-f46.google.com [74.125.82.46]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 6C6FA5FB46 for ; Sun, 5 Feb 2017 10:22:29 +0000 (UTC) Received: by mail-wm0-f46.google.com with SMTP id v77so85025720wmv.0 for ; Sun, 05 Feb 2017 02:22:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=9T/bXO/z8cauimn2UsPbhlFzCphj/0eQkX7omVzTQSo=; b=RgnVPt7eeWViaWoJ+RHr+/IPrgTkFL2sgtTJ4AlrJmXTHPFfEswppLT3h+ORCcCpqd seXdQ6bg95uW2EDUjhHSJNTK+TI79neZBLSH7DUB9SiFkzkIVYZM0jiZAz4nSgsaQNoL 8j95xpwSOARe56NfEyAdDiuHr+VtmlloIIOYbOEXVP/YT49qVt9/Yu5X44g79ycjmyeN rB6LY4O+xb203AqEd3Q2zFeWAz6YfkcRL2JFxaGBYJ7UlvFfrnuRcROqe4leTjKI+PwF U85+eSBEeNoGijZNVLLqKV6s7kkofvVL6tGOp74PfEauzJNOfxStgkRx+lPN+mod42M+ Mnaw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=9T/bXO/z8cauimn2UsPbhlFzCphj/0eQkX7omVzTQSo=; b=FlmQGquFDzjVQ8lq4QdS66ajUPR5jkqjMIOFrorW2PcELH1QsNAMmkswDlJdgPVxso xEI6X0FKo+4f8ezHT2RAD/kTIjzkExYY0eO0DBd9SOQvpf6Jhm3gTBxItJNakGvKbmw5 XTvQQRwXBgob5Rd7szPJ6ojUGp+x85YEw6UXAyTv2mgzsJdvAhiccrao3DLHc7mhZQJ2 cdnZ6nC9QAnzoCiPFVVFb+ABKzr5tr9jxyqVmmGYEsk11R0gvoMhRjZ/agretPsxD+jn 4bZihKKcWx7NhGpXrnh0zkS8vus/ujvN/z/HYOvqjBYgx42aYiu4AIAOwpHR+R6dRocY Y7aQ== X-Gm-Message-State: AMke39ny7SFDW57iRlr3Y1rw58JC+JXAqQCXY/26ylN03m9GXWrP2KTBj9KOaiEhz0tNhr0seEZ/zJYrs3lpzg== X-Received: by 10.28.125.149 with SMTP id y143mr4316889wmc.16.1486290143115; Sun, 05 Feb 2017 02:22:23 -0800 (PST) MIME-Version: 1.0 Received: by 10.80.130.35 with HTTP; Sun, 5 Feb 2017 02:22:22 -0800 (PST) In-Reply-To: References: From: Lars George Date: Sun, 5 Feb 2017 11:22:22 +0100 Message-ID: Subject: Re: Health Script does not stop region server To: "dev@hbase.apache.org" Content-Type: text/plain; charset=UTF-8 archived-at: Sun, 05 Feb 2017 10:22:35 -0000 Hi Ted, This does not work on Mac as provided. I tried on a CentOS 6 machine, and had to install net-snmp and net-snmp-utils, plus start the snmpd to make it time out quicker. But even even there the snmpwalk return nothing, making the script fail. Anyhow, the snmpwalk failing after the retries is just an example of what can happen if the health check script takes too long to fail. The bottom line is that it does _not_ stop the server as expected as our check in the code is reset because of the chore's delay. That is a bug methinks. Or, in other words, when I fixed the snmpwalk to come back quickly as explained above, the error was caught in time and the server stopped as expected. Makes sense? Lars On Sat, Feb 4, 2017 at 4:30 PM, Ted Yu wrote: > Running the command from the script locally (on Mac): > > $ /usr/bin/snmpwalk -t 5 -Oe -Oq -Os -v 1 -c public localhost if > Timeout: No Response from localhost > $ echo $? > 1 > > Looks like the script should parse the output from snmpwalk and provide > some hint if unexpected result is reported. > > Cheers > > On Sat, Feb 4, 2017 at 6:40 AM, Lars George wrote: > >> Hi, >> >> I tried the supplied `healthcheck.sh`, but did not have snmpd running. >> That caused the script to take a long time to error out, which exceed >> the 10 seconds the check was meant to run. That resets the check and >> it keeps reporting the error, but never stops the servers: >> >> 2017-02-04 05:55:08,962 INFO >> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020] >> hbase.HealthCheckChore: Health Check Chore runs every 10sec >> 2017-02-04 05:55:08,975 INFO >> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020] >> hbase.HealthChecker: HealthChecker initialized with script at >> /opt/hbase/bin/healthcheck.sh, timeout=60000 >> >> ... >> >> 2017-02-04 05:55:50,435 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> hbase.HealthCheckChore: Health status at 412837hrs, 55mins, 50sec : >> ERROR check link, OK: disks ok, >> >> 2017-02-04 05:55:50,436 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> hbase.ScheduledChore: Chore: CompactionChecker missed its start time >> 2017-02-04 05:55:50,437 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> hbase.ScheduledChore: Chore: >> slave-1.internal.larsgeorge.com,16020,1486216506007-MemstoreFlusherChore >> missed its start time >> 2017-02-04 05:55:50,438 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> 2017-02-04 05:56:20,522 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 20sec : >> ERROR check link, OK: disks ok, >> >> 2017-02-04 05:56:20,523 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> 2017-02-04 05:56:50,600 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 50sec : >> ERROR check link, OK: disks ok, >> >> 2017-02-04 05:56:50,600 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> 2017-02-04 05:57:20,681 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 20sec : >> ERROR check link, OK: disks ok, >> >> 2017-02-04 05:57:20,681 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> 2017-02-04 05:57:50,763 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 50sec : >> ERROR check link, OK: disks ok, >> >> 2017-02-04 05:57:50,764 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> 2017-02-04 05:58:20,844 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 20sec : >> ERROR check link, OK: disks ok, >> >> 2017-02-04 05:58:20,844 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> 2017-02-04 05:58:50,923 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 50sec : >> ERROR check link, OK: disks ok, >> >> 2017-02-04 05:58:50,923 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> 2017-02-04 05:59:21,017 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> hbase.HealthCheckChore: Health status at 412837hrs, 59mins, 21sec : >> ERROR check link, OK: disks ok, >> >> 2017-02-04 05:59:21,018 INFO >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> >> That seems like a bug, no? >> >> Lars >>