hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Health Script does not stop region server
Date Sun, 05 Feb 2017 14:57:54 GMT
Yesterday I tried snmpwalk on CentOS as well - same behavior.

Lars:
Can you file a JIRA to fix the bug ?

Thanks

On Sun, Feb 5, 2017 at 2:22 AM, Lars George <lars.george@gmail.com> wrote:

> Hi Ted,
>
> This does not work on Mac as provided. I tried on a CentOS 6 machine,
> and had to install net-snmp and net-snmp-utils, plus start the snmpd
> to make it time out quicker. But even even there the snmpwalk return
> nothing, making the script fail.
>
> Anyhow, the snmpwalk failing after the retries is just an example of
> what can happen if the health check script takes too long to fail. The
> bottom line is that it does _not_ stop the server as expected as our
> check in the code is reset because of the chore's delay. That is a bug
> methinks.
>
> Or, in other words, when I fixed the snmpwalk to come back quickly as
> explained above, the error was caught in time and the server stopped
> as expected.
>
> Makes sense?
>
> Lars
>
> On Sat, Feb 4, 2017 at 4:30 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> > Running the command from the script locally (on Mac):
> >
> > $ /usr/bin/snmpwalk -t 5 -Oe  -Oq  -Os -v 1 -c public localhost if
> > Timeout: No Response from localhost
> > $ echo $?
> > 1
> >
> > Looks like the script should parse the output from snmpwalk and provide
> > some hint if unexpected result is reported.
> >
> > Cheers
> >
> > On Sat, Feb 4, 2017 at 6:40 AM, Lars George <lars.george@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> I tried the supplied `healthcheck.sh`, but did not have snmpd running.
> >> That caused the script to take a long time to error out, which exceed
> >> the 10 seconds the check was meant to run. That resets the check and
> >> it keeps reporting the error, but never stops the servers:
> >>
> >> 2017-02-04 05:55:08,962 INFO
> >> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020]
> >> hbase.HealthCheckChore: Health Check Chore runs every 10sec
> >> 2017-02-04 05:55:08,975 INFO
> >> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020]
> >> hbase.HealthChecker: HealthChecker initialized with script at
> >> /opt/hbase/bin/healthcheck.sh, timeout=60000
> >>
> >> ...
> >>
> >> 2017-02-04 05:55:50,435 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
> >> hbase.HealthCheckChore: Health status at 412837hrs, 55mins, 50sec :
> >> ERROR check link, OK: disks ok,
> >>
> >> 2017-02-04 05:55:50,436 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
> >> hbase.ScheduledChore: Chore: CompactionChecker missed its start time
> >> 2017-02-04 05:55:50,437 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
> >> hbase.ScheduledChore: Chore:
> >> slave-1.internal.larsgeorge.com,16020,1486216506007-
> MemstoreFlusherChore
> >> missed its start time
> >> 2017-02-04 05:55:50,438 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
> >> 2017-02-04 05:56:20,522 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
> >> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 20sec :
> >> ERROR check link, OK: disks ok,
> >>
> >> 2017-02-04 05:56:20,523 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
> >> 2017-02-04 05:56:50,600 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
> >> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 50sec :
> >> ERROR check link, OK: disks ok,
> >>
> >> 2017-02-04 05:56:50,600 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
> >> 2017-02-04 05:57:20,681 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
> >> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 20sec :
> >> ERROR check link, OK: disks ok,
> >>
> >> 2017-02-04 05:57:20,681 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
> >> 2017-02-04 05:57:50,763 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
> >> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 50sec :
> >> ERROR check link, OK: disks ok,
> >>
> >> 2017-02-04 05:57:50,764 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
> >> 2017-02-04 05:58:20,844 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
> >> hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 20sec :
> >> ERROR check link, OK: disks ok,
> >>
> >> 2017-02-04 05:58:20,844 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
> >> 2017-02-04 05:58:50,923 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
> >> hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 50sec :
> >> ERROR check link, OK: disks ok,
> >>
> >> 2017-02-04 05:58:50,923 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
> >> 2017-02-04 05:59:21,017 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
> >> hbase.HealthCheckChore: Health status at 412837hrs, 59mins, 21sec :
> >> ERROR check link, OK: disks ok,
> >>
> >> 2017-02-04 05:59:21,018 INFO
> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
> >>
> >> That seems like a bug, no?
> >>
> >> Lars
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message