hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars George <lars.geo...@gmail.com>
Subject Re: Health Script does not stop region server
Date Sat, 11 Feb 2017 10:09:44 GMT
Will do, thanks


On Sun, Feb 5, 2017 at 3:57 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> Yesterday I tried snmpwalk on CentOS as well - same behavior.
>
> Lars:
> Can you file a JIRA to fix the bug ?
>
> Thanks
>
> On Sun, Feb 5, 2017 at 2:22 AM, Lars George <lars.george@gmail.com> wrote:
>
>> Hi Ted,
>>
>> This does not work on Mac as provided. I tried on a CentOS 6 machine,
>> and had to install net-snmp and net-snmp-utils, plus start the snmpd
>> to make it time out quicker. But even even there the snmpwalk return
>> nothing, making the script fail.
>>
>> Anyhow, the snmpwalk failing after the retries is just an example of
>> what can happen if the health check script takes too long to fail. The
>> bottom line is that it does _not_ stop the server as expected as our
>> check in the code is reset because of the chore's delay. That is a bug
>> methinks.
>>
>> Or, in other words, when I fixed the snmpwalk to come back quickly as
>> explained above, the error was caught in time and the server stopped
>> as expected.
>>
>> Makes sense?
>>
>> Lars
>>
>> On Sat, Feb 4, 2017 at 4:30 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>> > Running the command from the script locally (on Mac):
>> >
>> > $ /usr/bin/snmpwalk -t 5 -Oe  -Oq  -Os -v 1 -c public localhost if
>> > Timeout: No Response from localhost
>> > $ echo $?
>> > 1
>> >
>> > Looks like the script should parse the output from snmpwalk and provide
>> > some hint if unexpected result is reported.
>> >
>> > Cheers
>> >
>> > On Sat, Feb 4, 2017 at 6:40 AM, Lars George <lars.george@gmail.com>
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> I tried the supplied `healthcheck.sh`, but did not have snmpd running.
>> >> That caused the script to take a long time to error out, which exceed
>> >> the 10 seconds the check was meant to run. That resets the check and
>> >> it keeps reporting the error, but never stops the servers:
>> >>
>> >> 2017-02-04 05:55:08,962 INFO
>> >> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020]
>> >> hbase.HealthCheckChore: Health Check Chore runs every 10sec
>> >> 2017-02-04 05:55:08,975 INFO
>> >> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020]
>> >> hbase.HealthChecker: HealthChecker initialized with script at
>> >> /opt/hbase/bin/healthcheck.sh, timeout=60000
>> >>
>> >> ...
>> >>
>> >> 2017-02-04 05:55:50,435 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.HealthCheckChore: Health status at 412837hrs, 55mins, 50sec :
>> >> ERROR check link, OK: disks ok,
>> >>
>> >> 2017-02-04 05:55:50,436 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.ScheduledChore: Chore: CompactionChecker missed its start time
>> >> 2017-02-04 05:55:50,437 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.ScheduledChore: Chore:
>> >> slave-1.internal.larsgeorge.com,16020,1486216506007-
>> MemstoreFlusherChore
>> >> missed its start time
>> >> 2017-02-04 05:55:50,438 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> >> 2017-02-04 05:56:20,522 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> >> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 20sec :
>> >> ERROR check link, OK: disks ok,
>> >>
>> >> 2017-02-04 05:56:20,523 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> >> 2017-02-04 05:56:50,600 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> >> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 50sec :
>> >> ERROR check link, OK: disks ok,
>> >>
>> >> 2017-02-04 05:56:50,600 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> >> 2017-02-04 05:57:20,681 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 20sec :
>> >> ERROR check link, OK: disks ok,
>> >>
>> >> 2017-02-04 05:57:20,681 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> >> 2017-02-04 05:57:50,763 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 50sec :
>> >> ERROR check link, OK: disks ok,
>> >>
>> >> 2017-02-04 05:57:50,764 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> >> 2017-02-04 05:58:20,844 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 20sec :
>> >> ERROR check link, OK: disks ok,
>> >>
>> >> 2017-02-04 05:58:20,844 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> >> 2017-02-04 05:58:50,923 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 50sec :
>> >> ERROR check link, OK: disks ok,
>> >>
>> >> 2017-02-04 05:58:50,923 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> >> 2017-02-04 05:59:21,017 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> >> hbase.HealthCheckChore: Health status at 412837hrs, 59mins, 21sec :
>> >> ERROR check link, OK: disks ok,
>> >>
>> >> 2017-02-04 05:59:21,018 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> >>
>> >> That seems like a bug, no?
>> >>
>> >> Lars
>> >>
>>

Mime
View raw message