Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-dev@hadoop.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: Apache Wiki <wikidiffs@apache.org>
To: Apache Wiki <wikidiffs@apache.org>
Date: Fri, 07 Jun 2013 00:04:58 -0000
Message-ID: <20130607000458.83056.16569@eos.apache.org>
Subject: 
 =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22Virtual_Hadoop=22_by_LukeLu?=
Auto-Submitted: auto-generated

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch=
ange notification.

The "Virtual Hadoop" page has been changed by LukeLu:
https://wiki.apache.org/hadoop/Virtual%20Hadoop?action=3Ddiff&rev1=3D11&rev=
2=3D12

Comment:
clarify/update cloud/virtual implementations

  =

  A recurrent question on the user mailing lists is "can Hadoop be deployed=
 in virtual infrastructures", or "can you run Hadoop 'in the Cloud'", where=
 cloud means "separate storage and compute services, public or private".
  =

- These are actually two separate, but interrelated, questions, since every=
 cloud infrastructure depends upon virtualization to manage and present an =
aggregation of infrastructure components that can be quickly configured to =
meet a user's need. Cloud and virtualization need to be examined separately=
, but in all cases the answer is "Yes you can virtualize, and yes, you can =
deploy to the cloud, but you need to know the consequences and plan accordi=
ngly".
+ These are actually two separate, but interrelated, questions, since many =
cloud infrastructure depend upon virtualization to manage and present an ag=
gregation of infrastructure components that can be quickly configured to me=
et a user's need. Cloud and virtualization need to be examined separately, =
but in all cases the answer is "Yes you can virtualize, and yes, you can de=
ploy to the cloud, but you need to know the consequences and plan according=
ly".
  =

  First, some definitions and background:
  =

@@ -12, +12 @@

  =

  A Private Cloud is a collection of virtualized physical hardware that has=
 added services such as catalogs of software or defined platforms that a cu=
stomer can control. A private cloud differs from a public cloud in that it =
is generally owned and or managed by the same company or group as the custo=
mer. As an example, if I am responsible for a 100-node physical cluster, an=
d I need to share it between Sales and Marketing that wish to perform advan=
ced analytics with Hadoop and Engineering that wants to perform modelling o=
f a new production plant, with each getting 50% of the capacity, I could vi=
rtualize the physical architecture and allow the pool of capacity to be sha=
red between the competing groups, perhaps on a shared capacity or a swap-in=
/swap-out basis.
  =

- A Public Cloud is like a Private Cloud but owned and/or managed by an out=
side entity, for example Amazon Elastic Web Services. Public Clouds can pro=
vide cost benefits, either because you only pay for your use, or other's pa=
y for their use, but at the loss of control or of intermingling of data or =
other undesireable issues. Not being able to prove constant custody of some=
 types of data might be a legal liability for certain types of data or indu=
stries (PCI, HIPAA).
+ A Public Cloud is like a Private Cloud but owned and/or managed by an out=
side entity, for example Amazon Web Services. Public Clouds can provide cos=
t benefits, either because you only pay for your use, or other's pay for th=
eir use, but at the loss of control or of intermingling of data or other un=
desirable issues. Not being able to prove constant custody of some types of=
 data might be a legal liability for certain types of data or industries (P=
CI, HIPAA).
  =

  =

  =3D=3D Strengths of VM-hosted Hadoop =3D=3D
  =

-  * A single image can be cloned -lower operations costs.
+  * A single image can be cloned - lower operations costs.
   * Hadoop clusters can be set up on demand.
   * Physical infrastructure can be reused.
   * You only pay for the CPU time you need.
@@ -25, +25 @@

  =

  This sounds appealing, and is exactly what [[http://aws.amazon.com/elasti=
cmapreduce/|Amazon Elastic MapReduce]] offers: a version of Hadoop on a Pay=
-as-you-go-basis.
  =

- For more customised deployments, [[http://whirr.apache.org/|Apache Whirr]=
] can be used to bring up VMs, something documented [[https://ccp.cloudera.=
com/display/CDHDOC/Whirr+Installation|by Cloudera]]. There have also been d=
emonstrations of alternate systems running on different infrastructures, su=
ch as shown in [[http://www.slideshare.net/steve_l/farming-hadoop-intheclou=
d|Farming Hadoop in the Cloud]].
+ For more customized deployments, [[http://whirr.apache.org/|Apache Whirr]=
] can be used to bring up VMs, something documented [[https://ccp.cloudera.=
com/display/CDHDOC/Whirr+Installation|by Cloudera]]. There have also been d=
emonstrations of alternate systems running on different infrastructures, su=
ch as shown in [[http://www.slideshare.net/steve_l/farming-hadoop-intheclou=
d|Farming Hadoop in the Cloud]].
  =

- VMware has been active in the area of supporting Hadoop on in virtual inf=
rastructures.  You can read their take on the [[http://www.vmware.com/files=
/pdf/Benefits-of-Virtualizing-Hadoop.pdf|benefits of virtualizing Hadoop]] =
and also [[http://www.vmware.com/hadoop|other resources]] about deploying a=
nd running Hadoop in virtual infrastructures. It works with hadoop communit=
y on [[https://issues.apache.org/jira/browse/HADOOP-8468|Hadoop Virtualizat=
ion Extention]] to enhance hadoop's topology awareness on virtualized platf=
orm. =

+ VMware has been active in the area of supporting Hadoop on in virtual inf=
rastructures.  You can read their take on the [[http://www.vmware.com/files=
/pdf/Benefits-of-Virtualizing-Hadoop.pdf|benefits of virtualizing Hadoop]] =
and also [[http://www.vmware.com/hadoop|other resources]] about deploying a=
nd running Hadoop in virtual infrastructures. It works with Hadoop communit=
y on [[https://issues.apache.org/jira/browse/HADOOP-8468|Hadoop Virtualizat=
ion Extention]] to enhance Hadoop's topology awareness on virtualized platf=
orm. =

  =

- Does this mean that Hadoop is ideal in Cloud infrastructures? No ''but it=
 can be done.''
+ Does this mean that Hadoop is ideal in virtualized infrastuctures? It can=
 be when properly provisioned. Cloud? It depends on the cloud providers.
  =

  =3D=3D=3D Hadoop's Assumptions about its infrastructure =3D=3D=3D
  =

@@ -38, +38 @@

   1. A large cluster of physical servers, which may reboot, but generally =
recover, with all their local on-server HDD storage.
   1. Non-RAIDed Hard disks in the servers. This is the lowest cost per Ter=
abyte of any storage. It has good (local) bandwidth when retrieving sequent=
ial data: once the disks start seeking for the next blocks, performance suf=
fers badly.
   1. Dedicated CPUs; the CPU types are known and clusters are (usually) bu=
ilt from homogeneous hardware. =

-  1. Servers with monotonically increasing clocks, roughly synchronised vi=
a an NTP server. That is: time goes forward, on all servers simultaneously.
+  1. Servers with monotonically increasing clocks, roughly synchronized vi=
a an NTP server. That is: time goes forward, on all servers simultaneously.
   1. Dedicated network with exclusive use of a high-performance switch, fa=
st 1-10 Gb/s server Ethernet and faster 10 + Gb/s "backplane" interconnect =
between racks. =

-  1. A static network topology: servers do not move around.
+  1. A relative static data network topology: data nodes do not move aroun=
d.
   1. Exclusive use of the network by trusted users. =

-  1. High performance infrastructure services aid Hadoop (DNS, reverse DNS=
, NFS storage for NameNode snapshots)
+  1. High performance infrastructure services (DNS, reverse DNS, NFS stora=
ge for NameNode snapshots)
   1. The primary failure modes of machines are HDD failures, re-occurring =
memory failures, or overheating damage caused by fan failures.
-  1. Machine failures are normally independent, with the exception of the =
failure of Top of Rack switches, which can take a whole rack offline. Route=
r/Switch misconfigurations can have a similar effect.
+  1. Machine failures are normally independent, with the exception of the =
failure of Top of Rack switches, which can take a whole rack offline. Route=
r/Switch misconfiguration can have a similar effect.
-  1. If the entire datacenter restarts, almost all the machines will come =
back up -along with their data.
+  1. If the entire datacenter restarts, almost all the machines will come =
back up along with their data.
  =

  =3D=3D=3D Hadoop's implementation details =3D=3D=3D
  =

  This translates into code features.
   1. HDFS uses local disks for storage, replicating data across machines. =

   1. The MR engine scheduler that assumes that the Hadoop work has exclusi=
ve use of the server and tries to keep the disks and CPU as busy as possibl=
e.
-  1. Leases and timeouts are based on local clocks, not complex distribute=
d system clocks such as Lamport Clocks. That is in the Hadoop layer, and in=
 the entire network stack -TCP also uses local clocks.
+  1. Leases and timeouts are based on local clocks, not complex distribute=
d system clocks such as Lamport Clocks. That is in the Hadoop layer, and in=
 the entire network stack, TCP also uses local clocks.
   1. Topology scripts can be written to describe the network topology; the=
se are used to place data and work.
-  1. Data can be transmitted between machines unencrypted.
+  1. Data is usually transmitted between machines unencrypted
-  1. Code running on machines in the cluster (including user-supplied MR j=
obs), can be assumed to not be deliberately malicious.
+  1. Code running on machines in the cluster (including user-supplied MR j=
obs), can usually be assumed to not be deliberately malicious, unless in se=
cure setups.
   1. Missing hard disks are usually missing because they have failed, so t=
he data stored on them should be replicated and the disk left alone.
   1. Servers that are consistently slow to complete jobs should be blackli=
sted: no new work should be sent to them. =

-  1. The JobTracker should try and keep the cluster as busy as possible, t=
o maximise ROI on the servers and datacenter.
+  1. The JobTracker should try and keep the cluster as busy as possible, t=
o maximize ROI on the servers and datacenter.
   1. When a JobTracker has no work to perform, the servers are left idle. =

   1. If the entire datacenter restarts, the filesystem can recover, provid=
ed you have set up the NameNode and Secondary NameNode properly.
  =

  =3D=3D=3D How a virtual infrastructure differs from a physical datacenter=
 =3D=3D=3D
  =

- Hadoop's assumptions about a datacenter do not hold in a virtualized envi=
ronment.
+ Hadoop's assumptions about a datacenter do not always hold in a virtualiz=
ed environment.
-  1. Storage is usually one or more of transient virtual drives, transient=
 local physical drives, persistent local virtual drives, or remote SAN-moun=
ted block stores or file systems.
+  1. Storage could be one or more of transient virtual drives, transient l=
ocal physical drives, persistent local virtual drives, or remote SAN-mounte=
d block stores or file systems.
-  1. Storage in virtual hard drives may cause a lot of seeking, even if it=
 appears to be sequential access to the VM.
+  1. Storage in virtual hard drives might cause a lot of seeking if they s=
hare the same physical hard drive, even if it appears to be sequential acce=
ss to the VM.
   1. Networking may be slower and throttled by the infrastructure provider.
-  1. Virtual Machines are requested on demand from the infrastructure -the=
 machines will be allocated anywhere in the infrastructure, possibly on ser=
vers running other VMs at the same time.
+  1. Virtual Machines are requested on demand from the infrastructure: the=
 machines could be allocated anywhere in the infrastructure, possibly on se=
rvers running other VMs at the same time.
-  1. The other VMs may be heavy CPU and network users, which can cause the=
 Hadoop jobs to suffer. Alternatively, the heavy CPU and network load of Ha=
doop can cause problems for the other users of the server.
+  1. The other VMs may be heavy resource (CPU, IO and network) users, whic=
h could cause the Hadoop jobs to suffer. OTOH, the heavy load of Hadoop cou=
ld cause problems for the other users of the server, if the underlying hype=
rvisor lacks proper isolation features and/or policies.
-  1. VMs can be suspended and restarted without OS notification, this can =
cause clocks to move forward in jumps of many seconds. =

+  1. VMs could be suspended and restarted without OS notification, this ca=
n cause clocks to move forward in jumps of many seconds. =

-  1. Other users on the network may be able to listen to traffic, to disru=
pt it, and to access ports that are not authenticating all access.
+  1. If the Hadoop clusters share the VLAN with other users (which is not =
recommended), other users on the network may be able to listen to traffic, =
to disrupt it, and to access ports that are not authenticating all access.
   1. Some infrastructures may move VMs around; this can actually move cloc=
ks backwards when the new physical host's clock is behind that of the origi=
nal host.
-  1. Replication to (transient) hard drives is no longer a reliable way to=
 persist data.
+  1. Replication to transient hard drives is no longer a reliable way to p=
ersist data.
-  1. The network topology is not visible to the Hadoop cluster, though lat=
ency and bandwidth tests may be used to infer "closeness", to build a de-fa=
cto topology.
+  1. On some cloud providers, network topology may not visible to the Hado=
op cluster, though latency and bandwidth tests may be used to infer "closen=
ess", to build a de-facto topology.
   1. The correct way to deal with a VM that is showing re-occuring failure=
s is to release the VM and ask for a new one, instead of blacklisting it. =

   1. The JobTracker may want to request extra VMs when there is extra dema=
nd.
   1. The JobTracker may want to release VMs when there is idle time.
-  1. A failure of the hosting infrastructure can lose all machines simulta=
neously.
+  1. Like all hosted services, a failure of the hosting infrastructure cou=
ld lose all machines simultaneously though not necessarily permanently.
  =

  =3D=3D Implications =3D=3D
  =

  Ignoring low-level networking/clock issues, what does this mean? (Only va=
lid for some cloud vendors, it may be different for other cloud vendors or =
you own your virtualized infrastructure.)
  =

-  1. When you request a VM, it's performance may vary from previous reques=
ts (if no strong SLA restriction, like Elastic...). This can be due to CPU =
differences, or the other workloads.
+  1. When you request a VM, it's performance may vary from previous reques=
ts (when lack of isolation feature/policy). This can be due to CPU differen=
ces, or the other workloads.
-  1. There is no point writing topology scripts (if cloud vendor doesn't e=
xpose physical topology to you in some way).
+  1. There is no point writing topology scripts, if cloud vendor doesn't e=
xpose physical topology to you in some way. OTOH, [http://serengeti.cloudfo=
undry.com/ Project Serengeti] configures the topology script automatically =
for vSphere.
   1. All network ports must be closed by way of firewall and routing infor=
mation, apart from those ports critical for Hadoop -which must then run wit=
h security on.
   1. All data you wish to keep must be kept on permanent storage: mounted =
block stores, remote filesystems or external databases. This goes for both =
input and output.
   1. People or programs need to track machine failures and react to them b=
y releasing those machines and requesting new ones.
-  1. If the cluster is idle. some machines can be decomissioned.
+  1. If the cluster is idle. some machines can be decommissioned.
   1. If the cluster is overloaded, some temporary TaskTracker only servers=
 can be brought up for short periods of time, and killed when no longer nee=
ded.
   1. If the cluster needs to be expanded for a longer duration, worker nod=
es acting as both a DataNode and TaskTracker can be brought up.
   1. If the entire cluster goes down or restarts, all transient hard disks=
 will be lost (some cloud vendors treat VM disk as transient and provide ot=
her reliable storage service, but others are not. This note is only for pre=
vious vendor), and all data stored within the HDFS cluster with it.
  =

- The most significant implication is in storage. A core architectural desi=
gn of both Google's GFS and Hadoop's GFS is that three-way replication onto=
 local storage is ''a low-cost yet reliable way of storing Petabytes of dat=
a.'' This design is based on physical topology (rack and host) awareness of=
 hadoop so it can smartly place data block across rack and host to get surv=
ival from host/rack failure. In some cloud vendors' infrastructure, this de=
sign may no longer valid as they don't expose physical topology (even abstr=
acted) info to customer. In this case, you will be disappointed when one da=
y all your data disappears and please do not complain if this happens after=
 reading this page: you have been warned. If your cloud vendor do expose th=
is info in someway (and promise they are physical but not virtual) or you o=
wn your cloud infrastructure, the situation is different that you still can=
 have a reliable hadoop cluster like in physical environment.
+ The most significant implication is in storage. A core architectural desi=
gn of both Google's GFS and Hadoop's GFS is that three-way replication onto=
 local storage is ''a low-cost yet reliable way of storing Petabytes of dat=
a.'' This design is based on physical topology (rack and host) awareness of=
 hadoop so it can smartly place data block across rack and host to get surv=
ival from host/rack failure. In some cloud vendors' infrastructure, this de=
sign may no longer valid as they don't expose physical topology (even abstr=
acted) info to customer. In this case, you will be disappointed when one da=
y all your data disappears and please do not complain if this happens after=
 reading this page: you have been warned. If your cloud vendor do expose th=
is info in someway (or promise they are physical but not virtual) or you ow=
n your cloud infrastructure, the situation is different that you still can =
have a reliable Hadoop cluster like in physical environment.
  =

  =3D=3D Why use Hadoop on Cloud Infrastructures then? =3D=3D
  =

- Having just explained why HDFS does not protect your data when hosted in =
a cloud infrastructure, is there any reason to consider it? Yes.
+ Having just explained why HDFS might not protect your data when hosted in=
 a cloud infrastructure, is there any reason to consider it? Yes.
  =

+  * For private cloud, where the admins can properly provision virtual inf=
rastructure for Hadoop
+    * HDFS is as reliable and efficient as in physical.
+    * Virtualization can provide much higher hardware utilization by conso=
lidating multiple Hadoop clusters and other workload on the same physical c=
luster
+    * Higher performance for some workload (including terasort) than physi=
cal for typical 2 CPU socket Hadoop nodes due to better NUMA and disk sched=
uling
+    * Per tenant VLAN via SDN for better security than typical shared phys=
ical Hadoop cluster =

   * Given the choice between a virtual Hadoop and no Hadoop, virtual Hadoo=
p is compelling.
   * Using Apache Hadoop as your MapReduce infrastructure gives you Cloud v=
endor independence, and the option of moving to a permanent physical deploy=
ment later.
   * It is the only way to execute the tools that work with Hadoop and the =
layers above it in a Cloud environment.
-  * If you store your persistent data in a cloud-hosted storage infrastruc=
ture, analysing the data in the provider's computation infrastructure is th=
e most cost-effective way to do so.
+  * If you store your persistent data in a cloud-hosted storage infrastruc=
ture, analyzing the data in the provider's computation infrastructure is th=
e most cost-effective way to do so.
  =

- You just need to recognise the limitations and accept them:
+ You just need to recognize the limitations and accept them:
-  * Treat the HDFS filesystem and local disks as transient storage; keep t=
he persistent data elsewhere.
+  * For vendors like AWS, treat the HDFS filesystem and local disks as tra=
nsient storage; keep the persistent data elsewhere.
-  * Expect reduced performance and try to compensate by allocating more VM=
s.
+  * For public cloud, expect reduced performance and try to compensate by =
allocating more VMs.
   * Save money by shutting down the cluster when not needed.
   * Don't be surprised if different instances of the clusters have differe=
nt performance, or the a cluster's performance varies from time to time.
-  * The cost of persistent data will probably be higher than if you built =
up a physical cluster with the same amount of storage. This will not be an =
issue until your dataset is measure in many Terabytes, or even Petabytes. =

-  * Over time, dataset size grows, often at a predictable rate. That stora=
ge cost may dominate over time. Compress your data even when stored on the =
service provider's infrastructure.
+  * For public cloud, the cost of persistent data will probably be higher =
than if you built up a physical cluster with the same amount of storage. Th=
is will not be an issue until your dataset is measure in many Terabytes, or=
 even Petabytes. =

+  * For public cloud, over time, dataset size grows, often at a predictabl=
e rate. That storage cost may dominate over time. Compress your data even w=
hen stored on the service provider's infrastructure.
  =

  =3D=3D Hosting on local VMs =3D=3D
  =

- As well as large-scale cloud infrastructures, there is another deployment=
 pattern: local VMs on desktop systems or other development machines. This =
is a good tactic if your physical machines run windows and you need to brin=
g up a Linux system running Hadoop, and/or you want to simulate the complex=
ity of a small Hadoop cluster.
+ As well as large-scale cloud infrastructures, there is another deployment=
 pattern (typically for development and testing): local VMs on desktop syst=
ems or other development machines. This is a good tactic if your physical m=
achines run windows and you need to bring up a Linux system running Hadoop,=
 and/or you want to simulate the complexity of a small Hadoop cluster.
  =

   * Have enough RAM for the VM to not swap.
-  * Don't try and run more than one VM per physical host, it will only mak=
e things slower. =

+  * Don't try and run more than one VM per physical host with less than 2 =
CPU socket, it will only make things slower. =

-  * use file: URLs to access persistent input and output data.
+  * use host shared folders to access persistent input and output data.
   * consider making the default filesystem a file: URL so that all storage=
 is really on the physical host. It's often faster and preserves data bette=
r.
  =

  =3D=3D Summary =3D=3D
  =

- You can bring up Hadoop in cloud infrastructures, and sometimes it makes =
sense, for development and production. For production use, be aware that th=
e differences between physical and virtual infrastructures can threaten you=
r data integrity and security - and you must plan for that. =

+ You can bring up Hadoop in virtualized infrastructures. Sometimes it even=
 makes sense for public cloud, for development and production. For producti=
on use, be aware that the differences between physical and virtual infrastr=
uctures could pose additional gotchas to your data integrity and security w=
ithout proper planning and provisioning. =

 =20