Return-Path: X-Original-To: apmail-hadoop-common-commits-archive@www.apache.org Delivered-To: apmail-hadoop-common-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AA19F10278 for ; Fri, 7 Jun 2013 00:05:26 +0000 (UTC) Received: (qmail 96408 invoked by uid 500); 7 Jun 2013 00:05:26 -0000 Delivered-To: apmail-hadoop-common-commits-archive@hadoop.apache.org Received: (qmail 96334 invoked by uid 500); 7 Jun 2013 00:05:25 -0000 Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-commits@hadoop.apache.org Received: (qmail 96327 invoked by uid 500); 7 Jun 2013 00:05:25 -0000 Delivered-To: apmail-hadoop-core-commits@hadoop.apache.org Received: (qmail 96324 invoked by uid 99); 7 Jun 2013 00:05:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Jun 2013 00:05:25 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.131] (HELO eos.apache.org) (140.211.11.131) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Jun 2013 00:05:19 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 1844997A; Fri, 7 Jun 2013 00:04:59 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Fri, 07 Jun 2013 00:04:58 -0000 Message-ID: <20130607000458.83056.16569@eos.apache.org> Subject: =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22Virtual_Hadoop=22_by_LukeLu?= Auto-Submitted: auto-generated X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch= ange notification. The "Virtual Hadoop" page has been changed by LukeLu: https://wiki.apache.org/hadoop/Virtual%20Hadoop?action=3Ddiff&rev1=3D11&rev= 2=3D12 Comment: clarify/update cloud/virtual implementations = A recurrent question on the user mailing lists is "can Hadoop be deployed= in virtual infrastructures", or "can you run Hadoop 'in the Cloud'", where= cloud means "separate storage and compute services, public or private". = - These are actually two separate, but interrelated, questions, since every= cloud infrastructure depends upon virtualization to manage and present an = aggregation of infrastructure components that can be quickly configured to = meet a user's need. Cloud and virtualization need to be examined separately= , but in all cases the answer is "Yes you can virtualize, and yes, you can = deploy to the cloud, but you need to know the consequences and plan accordi= ngly". + These are actually two separate, but interrelated, questions, since many = cloud infrastructure depend upon virtualization to manage and present an ag= gregation of infrastructure components that can be quickly configured to me= et a user's need. Cloud and virtualization need to be examined separately, = but in all cases the answer is "Yes you can virtualize, and yes, you can de= ploy to the cloud, but you need to know the consequences and plan according= ly". = First, some definitions and background: = @@ -12, +12 @@ = A Private Cloud is a collection of virtualized physical hardware that has= added services such as catalogs of software or defined platforms that a cu= stomer can control. A private cloud differs from a public cloud in that it = is generally owned and or managed by the same company or group as the custo= mer. As an example, if I am responsible for a 100-node physical cluster, an= d I need to share it between Sales and Marketing that wish to perform advan= ced analytics with Hadoop and Engineering that wants to perform modelling o= f a new production plant, with each getting 50% of the capacity, I could vi= rtualize the physical architecture and allow the pool of capacity to be sha= red between the competing groups, perhaps on a shared capacity or a swap-in= /swap-out basis. = - A Public Cloud is like a Private Cloud but owned and/or managed by an out= side entity, for example Amazon Elastic Web Services. Public Clouds can pro= vide cost benefits, either because you only pay for your use, or other's pa= y for their use, but at the loss of control or of intermingling of data or = other undesireable issues. Not being able to prove constant custody of some= types of data might be a legal liability for certain types of data or indu= stries (PCI, HIPAA). + A Public Cloud is like a Private Cloud but owned and/or managed by an out= side entity, for example Amazon Web Services. Public Clouds can provide cos= t benefits, either because you only pay for your use, or other's pay for th= eir use, but at the loss of control or of intermingling of data or other un= desirable issues. Not being able to prove constant custody of some types of= data might be a legal liability for certain types of data or industries (P= CI, HIPAA). = = =3D=3D Strengths of VM-hosted Hadoop =3D=3D = - * A single image can be cloned -lower operations costs. + * A single image can be cloned - lower operations costs. * Hadoop clusters can be set up on demand. * Physical infrastructure can be reused. * You only pay for the CPU time you need. @@ -25, +25 @@ = This sounds appealing, and is exactly what [[http://aws.amazon.com/elasti= cmapreduce/|Amazon Elastic MapReduce]] offers: a version of Hadoop on a Pay= -as-you-go-basis. = - For more customised deployments, [[http://whirr.apache.org/|Apache Whirr]= ] can be used to bring up VMs, something documented [[https://ccp.cloudera.= com/display/CDHDOC/Whirr+Installation|by Cloudera]]. There have also been d= emonstrations of alternate systems running on different infrastructures, su= ch as shown in [[http://www.slideshare.net/steve_l/farming-hadoop-intheclou= d|Farming Hadoop in the Cloud]]. + For more customized deployments, [[http://whirr.apache.org/|Apache Whirr]= ] can be used to bring up VMs, something documented [[https://ccp.cloudera.= com/display/CDHDOC/Whirr+Installation|by Cloudera]]. There have also been d= emonstrations of alternate systems running on different infrastructures, su= ch as shown in [[http://www.slideshare.net/steve_l/farming-hadoop-intheclou= d|Farming Hadoop in the Cloud]]. = - VMware has been active in the area of supporting Hadoop on in virtual inf= rastructures. You can read their take on the [[http://www.vmware.com/files= /pdf/Benefits-of-Virtualizing-Hadoop.pdf|benefits of virtualizing Hadoop]] = and also [[http://www.vmware.com/hadoop|other resources]] about deploying a= nd running Hadoop in virtual infrastructures. It works with hadoop communit= y on [[https://issues.apache.org/jira/browse/HADOOP-8468|Hadoop Virtualizat= ion Extention]] to enhance hadoop's topology awareness on virtualized platf= orm. = + VMware has been active in the area of supporting Hadoop on in virtual inf= rastructures. You can read their take on the [[http://www.vmware.com/files= /pdf/Benefits-of-Virtualizing-Hadoop.pdf|benefits of virtualizing Hadoop]] = and also [[http://www.vmware.com/hadoop|other resources]] about deploying a= nd running Hadoop in virtual infrastructures. It works with Hadoop communit= y on [[https://issues.apache.org/jira/browse/HADOOP-8468|Hadoop Virtualizat= ion Extention]] to enhance Hadoop's topology awareness on virtualized platf= orm. = = - Does this mean that Hadoop is ideal in Cloud infrastructures? No ''but it= can be done.'' + Does this mean that Hadoop is ideal in virtualized infrastuctures? It can= be when properly provisioned. Cloud? It depends on the cloud providers. = =3D=3D=3D Hadoop's Assumptions about its infrastructure =3D=3D=3D = @@ -38, +38 @@ 1. A large cluster of physical servers, which may reboot, but generally = recover, with all their local on-server HDD storage. 1. Non-RAIDed Hard disks in the servers. This is the lowest cost per Ter= abyte of any storage. It has good (local) bandwidth when retrieving sequent= ial data: once the disks start seeking for the next blocks, performance suf= fers badly. 1. Dedicated CPUs; the CPU types are known and clusters are (usually) bu= ilt from homogeneous hardware. = - 1. Servers with monotonically increasing clocks, roughly synchronised vi= a an NTP server. That is: time goes forward, on all servers simultaneously. + 1. Servers with monotonically increasing clocks, roughly synchronized vi= a an NTP server. That is: time goes forward, on all servers simultaneously. 1. Dedicated network with exclusive use of a high-performance switch, fa= st 1-10 Gb/s server Ethernet and faster 10 + Gb/s "backplane" interconnect = between racks. = - 1. A static network topology: servers do not move around. + 1. A relative static data network topology: data nodes do not move aroun= d. 1. Exclusive use of the network by trusted users. = - 1. High performance infrastructure services aid Hadoop (DNS, reverse DNS= , NFS storage for NameNode snapshots) + 1. High performance infrastructure services (DNS, reverse DNS, NFS stora= ge for NameNode snapshots) 1. The primary failure modes of machines are HDD failures, re-occurring = memory failures, or overheating damage caused by fan failures. - 1. Machine failures are normally independent, with the exception of the = failure of Top of Rack switches, which can take a whole rack offline. Route= r/Switch misconfigurations can have a similar effect. + 1. Machine failures are normally independent, with the exception of the = failure of Top of Rack switches, which can take a whole rack offline. Route= r/Switch misconfiguration can have a similar effect. - 1. If the entire datacenter restarts, almost all the machines will come = back up -along with their data. + 1. If the entire datacenter restarts, almost all the machines will come = back up along with their data. = =3D=3D=3D Hadoop's implementation details =3D=3D=3D = This translates into code features. 1. HDFS uses local disks for storage, replicating data across machines. = 1. The MR engine scheduler that assumes that the Hadoop work has exclusi= ve use of the server and tries to keep the disks and CPU as busy as possibl= e. - 1. Leases and timeouts are based on local clocks, not complex distribute= d system clocks such as Lamport Clocks. That is in the Hadoop layer, and in= the entire network stack -TCP also uses local clocks. + 1. Leases and timeouts are based on local clocks, not complex distribute= d system clocks such as Lamport Clocks. That is in the Hadoop layer, and in= the entire network stack, TCP also uses local clocks. 1. Topology scripts can be written to describe the network topology; the= se are used to place data and work. - 1. Data can be transmitted between machines unencrypted. + 1. Data is usually transmitted between machines unencrypted - 1. Code running on machines in the cluster (including user-supplied MR j= obs), can be assumed to not be deliberately malicious. + 1. Code running on machines in the cluster (including user-supplied MR j= obs), can usually be assumed to not be deliberately malicious, unless in se= cure setups. 1. Missing hard disks are usually missing because they have failed, so t= he data stored on them should be replicated and the disk left alone. 1. Servers that are consistently slow to complete jobs should be blackli= sted: no new work should be sent to them. = - 1. The JobTracker should try and keep the cluster as busy as possible, t= o maximise ROI on the servers and datacenter. + 1. The JobTracker should try and keep the cluster as busy as possible, t= o maximize ROI on the servers and datacenter. 1. When a JobTracker has no work to perform, the servers are left idle. = 1. If the entire datacenter restarts, the filesystem can recover, provid= ed you have set up the NameNode and Secondary NameNode properly. = =3D=3D=3D How a virtual infrastructure differs from a physical datacenter= =3D=3D=3D = - Hadoop's assumptions about a datacenter do not hold in a virtualized envi= ronment. + Hadoop's assumptions about a datacenter do not always hold in a virtualiz= ed environment. - 1. Storage is usually one or more of transient virtual drives, transient= local physical drives, persistent local virtual drives, or remote SAN-moun= ted block stores or file systems. + 1. Storage could be one or more of transient virtual drives, transient l= ocal physical drives, persistent local virtual drives, or remote SAN-mounte= d block stores or file systems. - 1. Storage in virtual hard drives may cause a lot of seeking, even if it= appears to be sequential access to the VM. + 1. Storage in virtual hard drives might cause a lot of seeking if they s= hare the same physical hard drive, even if it appears to be sequential acce= ss to the VM. 1. Networking may be slower and throttled by the infrastructure provider. - 1. Virtual Machines are requested on demand from the infrastructure -the= machines will be allocated anywhere in the infrastructure, possibly on ser= vers running other VMs at the same time. + 1. Virtual Machines are requested on demand from the infrastructure: the= machines could be allocated anywhere in the infrastructure, possibly on se= rvers running other VMs at the same time. - 1. The other VMs may be heavy CPU and network users, which can cause the= Hadoop jobs to suffer. Alternatively, the heavy CPU and network load of Ha= doop can cause problems for the other users of the server. + 1. The other VMs may be heavy resource (CPU, IO and network) users, whic= h could cause the Hadoop jobs to suffer. OTOH, the heavy load of Hadoop cou= ld cause problems for the other users of the server, if the underlying hype= rvisor lacks proper isolation features and/or policies. - 1. VMs can be suspended and restarted without OS notification, this can = cause clocks to move forward in jumps of many seconds. = + 1. VMs could be suspended and restarted without OS notification, this ca= n cause clocks to move forward in jumps of many seconds. = - 1. Other users on the network may be able to listen to traffic, to disru= pt it, and to access ports that are not authenticating all access. + 1. If the Hadoop clusters share the VLAN with other users (which is not = recommended), other users on the network may be able to listen to traffic, = to disrupt it, and to access ports that are not authenticating all access. 1. Some infrastructures may move VMs around; this can actually move cloc= ks backwards when the new physical host's clock is behind that of the origi= nal host. - 1. Replication to (transient) hard drives is no longer a reliable way to= persist data. + 1. Replication to transient hard drives is no longer a reliable way to p= ersist data. - 1. The network topology is not visible to the Hadoop cluster, though lat= ency and bandwidth tests may be used to infer "closeness", to build a de-fa= cto topology. + 1. On some cloud providers, network topology may not visible to the Hado= op cluster, though latency and bandwidth tests may be used to infer "closen= ess", to build a de-facto topology. 1. The correct way to deal with a VM that is showing re-occuring failure= s is to release the VM and ask for a new one, instead of blacklisting it. = 1. The JobTracker may want to request extra VMs when there is extra dema= nd. 1. The JobTracker may want to release VMs when there is idle time. - 1. A failure of the hosting infrastructure can lose all machines simulta= neously. + 1. Like all hosted services, a failure of the hosting infrastructure cou= ld lose all machines simultaneously though not necessarily permanently. = =3D=3D Implications =3D=3D = Ignoring low-level networking/clock issues, what does this mean? (Only va= lid for some cloud vendors, it may be different for other cloud vendors or = you own your virtualized infrastructure.) = - 1. When you request a VM, it's performance may vary from previous reques= ts (if no strong SLA restriction, like Elastic...). This can be due to CPU = differences, or the other workloads. + 1. When you request a VM, it's performance may vary from previous reques= ts (when lack of isolation feature/policy). This can be due to CPU differen= ces, or the other workloads. - 1. There is no point writing topology scripts (if cloud vendor doesn't e= xpose physical topology to you in some way). + 1. There is no point writing topology scripts, if cloud vendor doesn't e= xpose physical topology to you in some way. OTOH, [http://serengeti.cloudfo= undry.com/ Project Serengeti] configures the topology script automatically = for vSphere. 1. All network ports must be closed by way of firewall and routing infor= mation, apart from those ports critical for Hadoop -which must then run wit= h security on. 1. All data you wish to keep must be kept on permanent storage: mounted = block stores, remote filesystems or external databases. This goes for both = input and output. 1. People or programs need to track machine failures and react to them b= y releasing those machines and requesting new ones. - 1. If the cluster is idle. some machines can be decomissioned. + 1. If the cluster is idle. some machines can be decommissioned. 1. If the cluster is overloaded, some temporary TaskTracker only servers= can be brought up for short periods of time, and killed when no longer nee= ded. 1. If the cluster needs to be expanded for a longer duration, worker nod= es acting as both a DataNode and TaskTracker can be brought up. 1. If the entire cluster goes down or restarts, all transient hard disks= will be lost (some cloud vendors treat VM disk as transient and provide ot= her reliable storage service, but others are not. This note is only for pre= vious vendor), and all data stored within the HDFS cluster with it. = - The most significant implication is in storage. A core architectural desi= gn of both Google's GFS and Hadoop's GFS is that three-way replication onto= local storage is ''a low-cost yet reliable way of storing Petabytes of dat= a.'' This design is based on physical topology (rack and host) awareness of= hadoop so it can smartly place data block across rack and host to get surv= ival from host/rack failure. In some cloud vendors' infrastructure, this de= sign may no longer valid as they don't expose physical topology (even abstr= acted) info to customer. In this case, you will be disappointed when one da= y all your data disappears and please do not complain if this happens after= reading this page: you have been warned. If your cloud vendor do expose th= is info in someway (and promise they are physical but not virtual) or you o= wn your cloud infrastructure, the situation is different that you still can= have a reliable hadoop cluster like in physical environment. + The most significant implication is in storage. A core architectural desi= gn of both Google's GFS and Hadoop's GFS is that three-way replication onto= local storage is ''a low-cost yet reliable way of storing Petabytes of dat= a.'' This design is based on physical topology (rack and host) awareness of= hadoop so it can smartly place data block across rack and host to get surv= ival from host/rack failure. In some cloud vendors' infrastructure, this de= sign may no longer valid as they don't expose physical topology (even abstr= acted) info to customer. In this case, you will be disappointed when one da= y all your data disappears and please do not complain if this happens after= reading this page: you have been warned. If your cloud vendor do expose th= is info in someway (or promise they are physical but not virtual) or you ow= n your cloud infrastructure, the situation is different that you still can = have a reliable Hadoop cluster like in physical environment. = =3D=3D Why use Hadoop on Cloud Infrastructures then? =3D=3D = - Having just explained why HDFS does not protect your data when hosted in = a cloud infrastructure, is there any reason to consider it? Yes. + Having just explained why HDFS might not protect your data when hosted in= a cloud infrastructure, is there any reason to consider it? Yes. = + * For private cloud, where the admins can properly provision virtual inf= rastructure for Hadoop + * HDFS is as reliable and efficient as in physical. + * Virtualization can provide much higher hardware utilization by conso= lidating multiple Hadoop clusters and other workload on the same physical c= luster + * Higher performance for some workload (including terasort) than physi= cal for typical 2 CPU socket Hadoop nodes due to better NUMA and disk sched= uling + * Per tenant VLAN via SDN for better security than typical shared phys= ical Hadoop cluster = * Given the choice between a virtual Hadoop and no Hadoop, virtual Hadoo= p is compelling. * Using Apache Hadoop as your MapReduce infrastructure gives you Cloud v= endor independence, and the option of moving to a permanent physical deploy= ment later. * It is the only way to execute the tools that work with Hadoop and the = layers above it in a Cloud environment. - * If you store your persistent data in a cloud-hosted storage infrastruc= ture, analysing the data in the provider's computation infrastructure is th= e most cost-effective way to do so. + * If you store your persistent data in a cloud-hosted storage infrastruc= ture, analyzing the data in the provider's computation infrastructure is th= e most cost-effective way to do so. = - You just need to recognise the limitations and accept them: + You just need to recognize the limitations and accept them: - * Treat the HDFS filesystem and local disks as transient storage; keep t= he persistent data elsewhere. + * For vendors like AWS, treat the HDFS filesystem and local disks as tra= nsient storage; keep the persistent data elsewhere. - * Expect reduced performance and try to compensate by allocating more VM= s. + * For public cloud, expect reduced performance and try to compensate by = allocating more VMs. * Save money by shutting down the cluster when not needed. * Don't be surprised if different instances of the clusters have differe= nt performance, or the a cluster's performance varies from time to time. - * The cost of persistent data will probably be higher than if you built = up a physical cluster with the same amount of storage. This will not be an = issue until your dataset is measure in many Terabytes, or even Petabytes. = - * Over time, dataset size grows, often at a predictable rate. That stora= ge cost may dominate over time. Compress your data even when stored on the = service provider's infrastructure. + * For public cloud, the cost of persistent data will probably be higher = than if you built up a physical cluster with the same amount of storage. Th= is will not be an issue until your dataset is measure in many Terabytes, or= even Petabytes. = + * For public cloud, over time, dataset size grows, often at a predictabl= e rate. That storage cost may dominate over time. Compress your data even w= hen stored on the service provider's infrastructure. = =3D=3D Hosting on local VMs =3D=3D = - As well as large-scale cloud infrastructures, there is another deployment= pattern: local VMs on desktop systems or other development machines. This = is a good tactic if your physical machines run windows and you need to brin= g up a Linux system running Hadoop, and/or you want to simulate the complex= ity of a small Hadoop cluster. + As well as large-scale cloud infrastructures, there is another deployment= pattern (typically for development and testing): local VMs on desktop syst= ems or other development machines. This is a good tactic if your physical m= achines run windows and you need to bring up a Linux system running Hadoop,= and/or you want to simulate the complexity of a small Hadoop cluster. = * Have enough RAM for the VM to not swap. - * Don't try and run more than one VM per physical host, it will only mak= e things slower. = + * Don't try and run more than one VM per physical host with less than 2 = CPU socket, it will only make things slower. = - * use file: URLs to access persistent input and output data. + * use host shared folders to access persistent input and output data. * consider making the default filesystem a file: URL so that all storage= is really on the physical host. It's often faster and preserves data bette= r. = =3D=3D Summary =3D=3D = - You can bring up Hadoop in cloud infrastructures, and sometimes it makes = sense, for development and production. For production use, be aware that th= e differences between physical and virtual infrastructures can threaten you= r data integrity and security - and you must plan for that. = + You can bring up Hadoop in virtualized infrastructures. Sometimes it even= makes sense for public cloud, for development and production. For producti= on use, be aware that the differences between physical and virtual infrastr= uctures could pose additional gotchas to your data integrity and security w= ithout proper planning and provisioning. = =20