hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "ZooKeeper/GSoCFailureDetector" by AbmarBarros
Date Mon, 16 Aug 2010 21:28:51 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "ZooKeeper/GSoCFailureDetector" page has been changed by AbmarBarros.
http://wiki.apache.org/hadoop/ZooKeeper/GSoCFailureDetector?action=diff&rev1=13&rev2=14

--------------------------------------------------

   * Made Chen's alpha parameter configurable, and not a quarter of the timeout
  
  ==== 16/Aug/10 ====
+  * Refactored the way default values are passed to failure detectors
   * Finished experimentation and written experiment report
  
  == Experimentation ==
@@ -114, +115 @@

  
   * '''First batch of tests''':
    * 1 client and 1 server connected by an transcontinental link (Campina Grande-Brazil /
Newark-USA)
+   * client sending async operations to server 
    * client running during 10 min (average)
    * link = 1MBps, 250ms
    * timeout = 5000ms
    * replication = 5
    * used the following failure detectors:
     * Fixed heartbeat
-    * Chen (alpha = 0, 500, 1000, 2000)
+    * Chen (alpha = [0, 1000])
-    * Bertier (moderationstep = 0, 250, 500, 1000)
+    * Bertier (moderationstep = [0, 1000])
-    * Phi accrual (threshold = .5, 2, 4, 8)
+    * Phi accrual (threshold = [.5, 8]; minwindowsize=50)
  
   * '''Second batch of tests''':
    * 200 clients and 1 server connected in an emulated WAN in emulab
+   * client sending async operations to server
    * clients running during 10 min (average)
    * link = 2MBps, 250ms, message loss probability of 0.1 
    * timeout = 5000ms
@@ -137, +140 @@

  
  ==== Results ====
   * '''First batch of tests''':
+    || Method || Average detection time || Stddev of the detection time || False suspicions
||
+    || Fixedhb || 4731.8 || 299.6985 || 0/5 ||
+    || Chen (alpha=0) || - || - || 5/5 ||
+    || Chen (alpha=1000) || 1810.8 || 347.3632 || 0/5 ||
+    || Bertier (moderation step = 0) || 784.6 || 483.5642 || 0/5 ||
+    || Bertier (moderation step = 1000) || 1228.2 || 804.5773 || 0/5 ||
+    || Phi accrual (threshold = 0.5) || 714.6667 || 521.9745 || 2/5 ||
+    || Phi accrual(threshold = 8.) || 1574.75 || 602.7799 || 1/5 ||
  
   * '''Second batch of tests''':
     * In these tests, Fixed heartbeat and Bertier's strategies did not present any false
suspicion. With the given alpha, Chen's presented 13/200 false suspicions, and the Phi-accrual,
with the windowminsize parameter equals to 0, have made false suspicion on all the clients.
Below, we show the average detection time of all methods but the Phi-accrual: 
- 
-    * {{http://www2.lsd.ufcg.edu.br/~abmar/zk/fd-comparison.png}}
+    {{http://www2.lsd.ufcg.edu.br/~abmar/zk/fd-comparison.png}}
- 
-    * The Phi-accrual method must be evaluated again with a better windowminsize parameter
and in a scenario with larger duration, so the warm-up period is not considered.   
+    * The Phi-accrual method must be evaluated again with a better windowminsize parameter
in a scenario with a greater duration, so the warm-up period is not considered.   
  
  ==== Concluding remarks ====
  
  As expected, we noticed that the fixed heartbeat method works well when we run ZooKeeper
in a controlled environment, where the network behavior is expected. In this cases we can
tune the fixed timeout after some network analysis. However, in scenarios where we have a
changing network behavior, such in a WAN, the adaptive methods can be a good pick. Below,
there is an overview of each failure detector:
   * '''Fixed heartbeat''': In average, with default parameters, the fixed heartbeat strategy
had the highest detection time, but with no false suspicion. However, if the timeout is not
well defined, failures may take a long time to be detected, or false suspicion rate would
be increased. As said before, this strategy is useful when there is a controlled environment,
in which the network can be characterized.
   * '''Chen''': This strategy requires some assumption over the network, once the administrator
needs to define the alpha parameter - the safety margin for the estimation. However, with
default parameters, Chen et al. method performed well in a WAN deploy. It managed to decrease
the average detection time with a low false suspicion rate.
-  * '''Bertier''': Bertier et al initially proposed a failure detector that requires no assumption
over the network but a single moderation step to be added to the estimation when the monitored
is at a suspected state when a heartbeat is received. With these experiments, we have come
to same conclusion as Hayashibara et al: that this failure detector is very sensitive to message
loss and fluctuation in the arrival times of heartbeats. In this sense, the moderation step
turned out to be an important parameter for this failure detector. With a moderation step
of 1000, Bertier's failure detector reached a higher average detection time than the Chen's
method, but lower than the fixed hearbeat strategy. It is worth to mention that Bertier’s
failure detector was primarily designed to be used over local area networks (LANs), that is,
environments wherein messages are seldom lost.
+  * '''Bertier''': Bertier et al initially proposed a failure detector that requires no assumption
over the network but a single moderation step to be added to the estimation when the monitored
is at a suspected state when a heartbeat is received. With these experiments, we have come
to same conclusion as Hayashibara et al: that this failure detector is very sensitive to message
loss and fluctuation in the arrival times of heartbeats. In this sense, the moderation step
turned out to be an important parameter for this failure detector. With a moderation step
of 1000, Bertier's failure detector reached a higher average detection time than the Chen's
method, but lower than the fixed hearbeat strategy. It is worth to mention that Bertier’s
failure detector was primarily designed to be used over local area networks (LANs), that is,
environments wherein messages are seldom lost. As we could see, with a single client Berties's
method stands out with a low detection time and no false suspicions, even with the moderation
step equals to 0.
   * '''Phi-accrual''': The phi-accrual is the method that requires less information about
the network behavior. However it relies on a large sampling window to perform a good estimation.
As we could see, in the experiments that a minimum window size was not used, there was a huge
number of false suspicions. The effect of the threshold is only noticeable when there is some
deviation from the average. The phi-accrual stands out in a WAN with unknown behavior, but
it is mandatory to set a good (high) initial timeout value for the warm-up period of the method,
which happens while the minimum window size is not reached.
    
  ----

Mime
View raw message