cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brenn Oosterbaan" <boosterb...@schubergphilis.com>
Subject Review Request: take into account potential NFS timeouts when determining if xenheartbeat timeout value has been met.
Date Wed, 27 Feb 2013 09:06:44 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9647/
-----------------------------------------------------------

Review request for cloudstack and Hugo Trippaers.


Description
-------

In some storage failure scenario’s the NFS timeout can cause writing the heartbeat to take
longer than expected. By comparing the last successful heartbeat epoch with the current epoch
we check if the timeout value has been met.


Diffs
-----

  scripts/vm/hypervisor/xenserver/xenheartbeat.sh 5edacf7 

Diff: https://reviews.apache.org/r/9647/diff/


Testing
-------

Tested on hostxxx with an empty heartbeat file:
Feb 26 21:54:13 hostxxx heartbeat: Problem with heartbeat, no iSCSI or NFS mount defined in
/opt/xensource/bin/heartbeat!

Tested on hostxxx with a 120 seconds timeout value by causing a storage failover (hits NFS
timeout):
Feb 26 08:04:15 hostxxx heartbeat: Potential problem with /var/run/sr-mount/d392d770-330b-bdbf-9c07-e1c38af81c6e/hb-faecefb3-9ac0-47a2-b0fb-ae383762ba13:
not reachable since 18 seconds
Feb 26 08:04:48 hostxxx heartbeat: Potential problem with /var/run/sr-mount/d392d770-330b-bdbf-9c07-e1c38af81c6e/hb-faecefb3-9ac0-47a2-b0fb-ae383762ba13:
not reachable since 51 seconds
Feb 26 08:05:20 hostxxx heartbeat: Potential problem with /var/run/sr-mount/d392d770-330b-bdbf-9c07-e1c38af81c6e/hb-faecefb3-9ac0-47a2-b0fb-ae383762ba13:
not reachable since 83 seconds
The storage failover stayed within the 120 seconds timeout value so no reboot

Tested on hostxxx with a 120 second timeout by removing the storage altogether (hits NFS timeout):
Feb 26 10:08:52 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 32 seconds
Feb 26 10:09:24 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 64 seconds
Feb 26 10:09:57 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 97 seconds
Feb 26 10:10:29 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 129 seconds
Feb 26 10:10:29 hostxxx heartbeat: Problem with /var/run/sr-mount/test/hb-test: not reachable
since 129 seconds, rebooting system!

Tested on hostxxx with a 120 second timeout by removing write rights on the storage (does
not hit NFS timeout):
Feb 26 10:22:13 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 5 seconds
Feb 26 10:22:18 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 10 seconds
Feb 26 10:22:23 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 15 seconds
Feb 26 10:22:28 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 20 seconds
Feb 26 10:22:33 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 25 seconds
Feb 26 10:22:38 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 30 seconds
Feb 26 10:22:43 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 35 seconds
Feb 26 10:22:48 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 40 seconds
Feb 26 10:22:53 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 45 seconds
Feb 26 10:22:58 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 50 seconds
Feb 26 10:23:03 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 55 seconds
Feb 26 10:23:08 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 60 seconds
Feb 26 10:23:13 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 65 seconds
Feb 26 10:23:18 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 70 seconds
Feb 26 10:23:23 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 75 seconds
Feb 26 10:23:28 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 80 seconds
Feb 26 10:23:33 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 85 seconds
Feb 26 10:23:38 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 90 seconds
Feb 26 10:23:43 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 95 seconds
Feb 26 10:23:48 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 100 seconds
Feb 26 10:23:53 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 105 seconds
Feb 26 10:23:58 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 110 seconds
Feb 26 10:24:03 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 115 seconds
Feb 26 10:24:08 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test:
not reachable since 120 seconds
Feb 26 10:24:08 hostxxx heartbeat: Problem with /var/run/sr-mount/test/hb-test: not reachable
for 120 seconds, rebooting system!


Thanks,

Brenn Oosterbaan


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message