hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Virtual Hadoop" by LukeLu
Date Fri, 07 Jun 2013 16:48:34 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Virtual Hadoop" page has been changed by LukeLu:
https://wiki.apache.org/hadoop/Virtual%20Hadoop?action=diff&rev1=14&rev2=15

  
  For more customized deployments, [[http://whirr.apache.org/|Apache Whirr]] can be used to
bring up VMs, something documented [[https://ccp.cloudera.com/display/CDHDOC/Whirr+Installation|by
Cloudera]]. There have also been demonstrations of alternate systems running on different
infrastructures, such as shown in [[http://www.slideshare.net/steve_l/farming-hadoop-inthecloud|Farming
Hadoop in the Cloud]].
  
- VMware has been active in the area of supporting Hadoop on in virtual infrastructures. 
You can read their take on the [[http://www.vmware.com/files/pdf/Benefits-of-Virtualizing-Hadoop.pdf|benefits
of virtualizing Hadoop]] and also [[http://www.vmware.com/hadoop|other resources]] about deploying
and running Hadoop in virtual infrastructures. It works with Hadoop community on [[https://issues.apache.org/jira/browse/HADOOP-8468|Hadoop
Virtualization Extention]] to enhance Hadoop's topology awareness on virtualized platform.

+ VMware has been active in the area of supporting Hadoop on in virtual infrastructures. 
You can read their take on the [[http://www.vmware.com/files/pdf/Benefits-of-Virtualizing-Hadoop.pdf|benefits
of virtualizing Hadoop]] and also [[http://www.vmware.com/hadoop|other resources]] about deploying
and running Hadoop in virtual infrastructures. It works with Hadoop community on [[https://issues.apache.org/jira/browse/HADOOP-8468|Hadoop
Virtualization Extention]] to enhance Hadoop's topology awareness on virtualized platform,
which is part of the Apache Hadoop release 1.2.0+.
  
  Does this mean that Hadoop is ideal in virtualized infrastuctures? It can be when properly
provisioned. Cloud? It depends on the cloud providers.
  
@@ -52, +52 @@

  This translates into code features.
   1. HDFS uses local disks for storage, replicating data across machines. 
   1. The MR engine scheduler that assumes that the Hadoop work has exclusive use of the server
and tries to keep the disks and CPU as busy as possible.
-  1. Leases and timeouts are based on local clocks, not complex distributed system clocks
such as Lamport Clocks. That is in the Hadoop layer, and in the entire network stack, TCP
also uses local clocks.
+  1. Leases and timeouts are based on local clocks, not complex distributed system clocks
such as Lamport Clocks. That is in the Hadoop layer, and in the entire network stack - TCP
also uses local clocks.
   1. Topology scripts can be written to describe the network topology; these are used to
place data and work.
   1. Data is usually transmitted between machines unencrypted
   1. Code running on machines in the cluster (including user-supplied MR jobs), can usually
be assumed to not be deliberately malicious, unless in secure setups.
@@ -85, +85 @@

  Ignoring low-level networking/clock issues, what does this mean? (Only valid for some cloud
vendors, it may be different for other cloud vendors or you own your virtualized infrastructure.)
  
   1. When you request a VM, it's performance may vary from previous requests (when missing
isolation feature/policy). This can be due to CPU differences, or the other workloads.
-  1. There is no point writing topology scripts, if cloud vendor doesn't expose physical
topology to you in some way. OTOH, [[http://serengeti.cloudfoundry.com/|Project Serengeti]]
configures the topology script automatically for vSphere.
+  1. There is no point writing topology scripts, if cloud vendor doesn't expose physical
topology to you in some way. OTOH, [[http://serengeti.cloudfoundry.com/|Project Serengeti]]
configures the topology script automatically for Apache Hadoop 1.2+ on vSphere.
   1. All network ports must be closed by way of firewall and routing information, apart from
those ports critical for Hadoop, which must then run with security on.
   1. All data you wish to keep must be kept on permanent storage: mounted block stores, remote
filesystems or external databases. This goes for both input and output.
   1. People or programs need to track machine failures and react to them by releasing those
machines and requesting new ones.
@@ -100, +100 @@

  
  Having just explained why HDFS might not protect your data when hosted in a cloud infrastructure,
is there any reason to consider it? Yes.
  
-  * For private cloud, where the admins can properly provision virtual infrastructure for
Hadoop
+  * For private cloud, where the admins can properly provision virtual infrastructure for
Hadoop:
-    * HDFS is as reliable and efficient as in physical.
+    * HDFS is as reliable and efficient as in physical with dedicated and/or shared local
storage depending on the isolation requirements
-    * Virtualization can provide much higher hardware utilization by consolidating multiple
Hadoop clusters and other workload on the same physical cluster
+    * Virtualization can provide higher hardware utilization by consolidating multiple Hadoop
clusters and other workload on the same physical cluster
-    * Higher performance for some workload (including terasort) than physical for typical
2 CPU socket Hadoop nodes due to better NUMA and disk scheduling
-    * Per tenant VLAN (VXLAN) for better security than typical shared physical Hadoop cluster

+    * [[https://www.vmware.com/files/pdf/techpaper/hadoop-vsphere51-32hosts.pdf|Higher performance
for some workload]] (including terasort) than physical for multi CPU socket machines (typically
recommended for Hadoop deployment) due to better NUMA control at hypervisor layer and reduced
OS cache and IO contention with multi-VM per host than the physical deployment where there
is only one OS per host.
+    * Per tenant VLAN (VXLAN) can provide better security than typical shared physical Hadoop
cluster, especially for YARN (in Hadoop 2+), where new non-MR workloads pose challenges to
security.
   * Given the choice between a virtual Hadoop and no Hadoop, virtual Hadoop is compelling.
   * Using Apache Hadoop as your MapReduce infrastructure gives you Cloud vendor independence,
and the option of moving to a permanent physical deployment later.
   * It is the only way to execute the tools that work with Hadoop and the layers above it
in a Cloud environment.
@@ -123, +123 @@

  As well as large-scale cloud infrastructures, there is another deployment pattern (typically
for development and testing): local VMs on desktop systems or other development machines.
This is a good tactic if your physical machines run windows and you need to bring up a Linux
system running Hadoop, and/or you want to simulate the complexity of a small Hadoop cluster.
  
   * Have enough RAM for the VM to not swap.
-  * Don't try and run more than one VM per physical host with less than 2 CPU socket, it
will only make things slower. 
+  * Don't try and run more than one VM per physical host with less than 2 CPU cores or limited
memory, it will only make things slower. 
-  * use host shared folders to access persistent input and output data.
+  * Use host shared folders to access persistent input and output data.
-  * consider making the default filesystem a file: URL so that all storage is really on the
physical host. It's often faster and preserves data better.
+  * Consider making the default filesystem a file: URL so that all storage is really on the
physical host. It's often faster (for Linux guests) and preserves data better.
  
  == Summary ==
  

Mime
View raw message