hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Virtual Hadoop" by JunpingDu
Date Wed, 22 Aug 2012 08:04:01 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Virtual Hadoop" page has been changed by JunpingDu:
http://wiki.apache.org/hadoop/Virtual%20Hadoop?action=diff&rev1=10&rev2=11

  
  == Implications ==
  
- Ignoring low-level networking/clock issues, what does this mean?
+ Ignoring low-level networking/clock issues, what does this mean? (Only valid for some cloud
vendors, it may be different for other cloud vendors or you own your virtualized infrastructure.)
  
-  1. When you request a VM, it's performance may vary from previous requests. This can be
due to CPU differences, or the other workloads.
+  1. When you request a VM, it's performance may vary from previous requests (if no strong
SLA restriction, like Elastic...). This can be due to CPU differences, or the other workloads.
-  1. There is no point writing topology scripts.
+  1. There is no point writing topology scripts (if cloud vendor doesn't expose physical
topology to you in some way).
   1. All network ports must be closed by way of firewall and routing information, apart from
those ports critical for Hadoop -which must then run with security on.
   1. All data you wish to keep must be kept on permanent storage: mounted block stores, remote
filesystems or external databases. This goes for both input and output.
   1. People or programs need to track machine failures and react to them by releasing those
machines and requesting new ones.
   1. If the cluster is idle. some machines can be decomissioned.
   1. If the cluster is overloaded, some temporary TaskTracker only servers can be brought
up for short periods of time, and killed when no longer needed.
   1. If the cluster needs to be expanded for a longer duration, worker nodes acting as both
a DataNode and TaskTracker can be brought up.
-  1. If the entire cluster goes down or restarts, all transient hard disks will be lost,
and all data stored within the HDFS cluster with it.
+  1. If the entire cluster goes down or restarts, all transient hard disks will be lost (some
cloud vendors treat VM disk as transient and provide other reliable storage service, but others
are not. This note is only for previous vendor), and all data stored within the HDFS cluster
with it.
  
+ The most significant implication is in storage. A core architectural design of both Google's
GFS and Hadoop's GFS is that three-way replication onto local storage is ''a low-cost yet
reliable way of storing Petabytes of data.'' This design is based on physical topology (rack
and host) awareness of hadoop so it can smartly place data block across rack and host to get
survival from host/rack failure. In some cloud vendors' infrastructure, this design may no
longer valid as they don't expose physical topology (even abstracted) info to customer. In
this case, you will be disappointed when one day all your data disappears and please do not
complain if this happens after reading this page: you have been warned. If your cloud vendor
do expose this info in someway (and promise they are physical but not virtual) or you own
your cloud infrastructure, the situation is different that you still can have a reliable hadoop
cluster like in physical environment.
- The most significant implication is in storage. A core architectural design of both Google's
GFS and Hadoop's GFS is that three-way replication onto local storage is ''a low-cost yet
reliable way of storing Petabytes of data.''
- 
- In a cloud infrastructure, this design is no longer valid. If you assume that it does, you
will be disappointed when one day all your data disappears. Please do not complain if this
happens after reading this page: you have been warned.
  
  == Why use Hadoop on Cloud Infrastructures then? ==
  

Mime
View raw message