hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "ZooKeeper/GSoCFailureDetector" by AbmarBarros
Date Mon, 16 Aug 2010 15:21:19 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "ZooKeeper/GSoCFailureDetector" page has been changed by AbmarBarros.
http://wiki.apache.org/hadoop/ZooKeeper/GSoCFailureDetector?action=diff&rev1=10&rev2=11

--------------------------------------------------

  
  ==== Experimental design ====
  
- ==== Results and conclusions ====
+  * '''First batch of tests''':
+   * 1 client and 1 server connected by an transcontinental link (Campina Grande-Brazil /
Newark-USA)
+   * link = 1MBps, 250ms
+   * timeout = 5000ms
+   * replication = 5
+   * used the following failure detectors:
+    * Fixed heartbeat
+    * Chen (alpha = 0, 500, 1000, 2000)
+    * Bertier (moderationstep = 0, 250, 500, 1000)
+    * Phi accrual (threshold = .5, 2, 4, 8)
  
+  * '''Second batch of tests''':
+   * 200 clients and 1 server connected in an emulated WAN in emulab
+   * link = 2MBps, 250ms, message loss probability of 0.1 
+   * timeout = 5000ms
+   * used the following failure detectors with default parameters:
+    * Fixed heartbeat
+    * Chen (alpha = 1250)
+    * Bertier (moderationstep = 1000)
+ 
+ ==== Results ====
+ 
+ ==== Concluding remarks ====
+ 
+ As expected, we noticed that the fixed heartbeat method works well when we run ZooKeeper
in a controlled environment, where the network behavior is expected. In this cases we can
tune the fixed timeout after some network analysis. However, in scenarios where we have a
changing network behavior, such in a WAN, the adaptive methods can be a good pick. Below,
there is an overview of each failure detector:
+  * '''Fixed heartbeat''': In average, with default parameters, the fixed heartbeat strategy
had the highest detection time, but with no false suspicion. However, if the timeout is not
well defined, failures may take a long time to be detected, or false suspicion rate would
be increased. As said before, this strategy is useful when there is a controlled environment,
in which the network can be characterized.
+  * '''Chen''': This strategy requires some assumption over the network, once the administrator
needs to define the alpha parameter - the safety margin for the estimation. However, with
default parameters, Chen et al. method performed well in a WAN deploy. It managed to decrease
the average detection time with a low false suspicion rate.
+  * '''Bertier''': Bertier et al initially proposed a failure detector that requires no assumption
over the network but a single moderation step to be added to the estimation when the monitored
is at a suspected state when a heartbeat is received. With these experiments, we have come
to same conclusion as Hayashibara et al: that this failure detector is very sensitive to message
loss and fluctuation in the arrival times of heartbeats. In this sense, the moderation step
turned out to be an important parameter for this failure detector. With a moderation step
of 1000, Bertier's failure detector reached a lower average detection time than the Chen's
method, higher than the fixed hearbeat strategy, however there were no false suspicions.
+  * '''Phi-accrual''':
  ----
  == Design decisions ==
  

Mime
View raw message