hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Virtual Hadoop" by SteveLoughran
Date Wed, 05 Oct 2011 20:36:14 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Virtual Hadoop" page has been changed by SteveLoughran:

new page on Hadoop in VM/cloud infrastructure

New page:
Describe Virtual Hadoop here.= Virtual Hadoop =

A recurrent question on the user mailing lists is "can Hadoop be deployed in virtual infrastructures",
or "can you run Hadoop 'in the Cloud'", where cloud means "separate storage and compute services,
public or private".

The answer is "yes, but you need to be aware of the limitations"

== Strengths of VM-hosted Hadoop ==

 * A single image can be cloned -lower operations costs.
 * Hadoop clusters can be set up on demand.
 * Physical infrastructure can be reused.
 * You only pay for the CPU time you need.
 * The cluster size can be expanded or contracted on demand.

This sounds appealing, and is exactly what [[http://aws.amazon.com/elasticmapreduce/|Amazon
Elastic MapReduce]] offers: a version of Hadoop on a Pay-as-you-go-basis.

For more customised deployments, [[http://whirr.apache.org/|Apache Whirr]] can be used to
bring up VMs, something documented [[https://ccp.cloudera.com/display/CDHDOC/Whirr+Installation|by
Cloudera]]. There have also been demonstrations of alternate systems running on different
infrastructures, such as shown in [[http://www.slideshare.net/steve_l/farming-hadoop-inthecloud|Farming
Hadoop in the Cloud]].

Does this mean that Hadoop is ideal in Cloud infrastructures? No ''but it can be done.''

=== Hadoop's Assumptions about its infrastructure ===

Hadoop was written expecting to run in large physical datacenters. Such datacenters and physical
hardware, storage and networking has some specific features.

 1. A large cluster of physical servers, which may reboot, but generally recover, with all
their local on-server HDD storage.
 1. Non-RAIDed Hard disks in the servers. This is the lowest cost per Terabyte of any storage.
It has good (local) bandwidth when retrieving sequential data: once the disks start seeking
for the next blocks, performance suffers badly.
 1. Dedicated CPUs; the CPU types are known and clusters are (usually) built from homogeneous
 1. Servers with monotonically increasing clocks, roughly synchronised via an NTP server.
That is: time goes forward, on all servers simultaneously.
 1. Dedicated network with exclusive use of a high-performance switch, fast 1-10 Gb/s server
Ethernet and faster 10 + Gb/s "backplane" interconnect between racks. 
 1. A static network topology: servers do not move around.
 1. Exclusive use of the network by trusted users. 
 1. High performance infrastructure services aid Hadoop (DNS, reverse DNS, NFS storage for
NameNode snapshots)
 1. The primary failure modes of machines are HDD failures, re-occurring memory failures,
or overheating damage caused by fan failures.
 1. Machine failures are normally independent, with the exception of the failure of Top of
Rack switches, which can take a whole rack offline. Router/Switch misconfigurations can have
a similar effect.
 1. If the entire datacenter restarts, the filesystem can recover, provided you have set up
the NameNode and Secondary NameNode properly.

=== Hadoop's implementation details ===

This translates into code features.
 1. HDFS uses local disks for storage, replicating data across machines. 
 1. The MR engine scheduler that assumes that the Hadoop work has exclusive use of the server
and tries to keep the disks and CPU as busy as possible.
 1. Leases and timeouts are based on local clocks, not complex distributed system clocks such
as Lamport Clocks. That is in the Hadoop layer, and in the entire network stack -TCP also
uses local clocks.
 1. Topology scripts can be written to describe the network topology; these are used to place
data and work.
 1. Data can be transmitted between machines unencrypted.
 1. Code running on machines in the cluster (including user-supplied MR jobs), can be assumed
to not be deliberately malicious.
 1. Missing hard disks are usually missing because they have failed, so the data stored on
them should be replicated and the disk left alone.
 1. Servers that are consistently slow to complete jobs should be blacklisted: no new work
should be sent to them. 
 1. The JobTracker should try and keep the cluster as busy as possible, to maximise ROI on
the servers and datacenter.
 1. When a JobTracker has no work to perform, the servers are left idle. 

=== How a virtual infrastructure differs from a physical datacenter ===

Hadoop's assumptions about a datacenter do not hold in a virtualized environment.
 1. Storage is usually one or more of transient virtual drives, transient local physical drives,
persistent local virtual drives, or remote SAN-mounted block stores or file systems.
 1. Storage in virtual hard drives may cause a lot of seeking, even if it appears to be sequential
access to the VM.
 1. Networking may be slower and throttled by the infrastructure provider.
 1. Virtual Machines are requested on demand from the infrastructure -the machines will be
allocated anywhere in the infrastructure, possibly on servers running other VMs at the same
 1. The other VMs may be heavy CPU and network users, which can cause the Hadoop jobs to suffer.
Alternatively, the heavy CPU and network load of Hadoop can cause problems for the other users
of the server.
 1. VMs can be suspended and restarted without OS notification, this can cause clocks to move
forward in jumps of many seconds. 
 1. Other users on the network may be able to listen to traffic, to disrupt it, and to access
ports that are not authenticating all access.
 1. Some infrastructures may move VMs around; this can actually move clocks backwards when
the new physical host's clock is behind that of the original host.
 1. Replication to (transient) hard drives is no longer a reliable way to persist data.
 1. The network topology is not visible to the Hadoop cluster, though latency and bandwidth
tests may be used to infer "closeness", to build a de-facto topology.
 1. The correct way to deal with a VM that is showing re-occuring failures is to release the
VM and ask for a new one, instead of blacklisting it. 
 1. The JobTracker may want to request extra VMs when there is extra demand.
 1. The JobTracker may want to release VMs when there is idle time.
 1. A failure of the hosting infrastructure can lose all machines simultaneously.

== Implications ==

Ignoring low-level networking/clock issues, what does this mean?

 1. When you request a VM, it's performance may vary from previous requests. This can be due
to CPU differences, or the other workloads.
 1. There is no point writing topology scripts.
 1. All network ports must be closed by way of firewall and routing information, apart from
those ports critical for Hadoop -which must then run with security on.
 1. All data you wish to keep must be kept on permanent storage: mounted block stores, remote
filesystems or external databases. This goes for both input and output.
 1. People or programs need to track machine failures and react to them by releasing those
machines and requesting new ones.
 1. If the cluster is idle. some machines can be decomissioned.
 1. If the cluster is overloaded, some temporary TaskTracker only servers can be brought up
for short periods of time, and killed when no longer needed.
 1. If the cluster needs to be expanded for a longer duration, worker nodes acting as both
a DataNode and TaskTracker can be brought up.

The most significant implication is in storage. A core architectural design of both Google's
GFS and Hadoop's GFS is that three-way replication onto local storage is the lowest cost way
of storing Petabytes of data. 

In a cloud infrastructure, this design is no longer valid. If you assume that it does, you
will be disappointed when one day all your data disappears. Please do not complain if this
happens after reading this page: you have been warned.

== Why use Hadoop on Cloud Infrastructures then? ==

Having just explained why HDFS does not protect your data when hosted in a cloud infrastructure,
is there any reason to consider it? Yes.

 * Given the choice between a virtual Hadoop and no Hadoop, virtual Hadoop is compelling.
 * Using Apache Hadoop as your MapReduce infrastructure gives you Cloud vendor independence,
and the option of moving to a permanent physical deployment later.
 * It is the only way to execute the tools that work with Hadoop and the layers above it in
a Cloud environment.
 * If you store your persistent data in a cloud-hosted storage infrastructure, analysing the
data in the provider's computation infrastructure is the most cost-effective way to do so.

You just need to recognise the limitations and accept them:
 * Treat the HDFS filesystem and local disks as transient storage; keep the persistent data
 * Expect reduced performance and try to compensate by allocating more VMs.
 * Save money by shutting down the cluster when not needed.
 * Don't be surprised if different instances of the clusters have different performance, or
the a cluster's performance varies from time to time.
 * The cost of persistent data will probably be higher than if you built up a physical cluster
with the same amount of storage. This will not be an issue until your dataset is measure in
many Terabytes, or even Petabytes. 
 * Over time, dataset size grows, often at a predictable rate. That storage cost may dominate
over time. Compress your data even when stored on the service provider's infrastructure.

== Hosting on local VMs ==

As well as large-scale cloud infrastructures, there is another deployment pattern: local VMs
on desktop systems or other development machines. This is a good tactic if your physical machines
run windows and you need to bring up a Linux system running Hadoop, and/or you want to simulate
the complexity of a small Hadoop cluster.

 * Have enough RAM for the VM to not swap.
 * Don't try and run more than one VM per physical host, it will only make things slower.

 * use file: URLs to access persistent input and output data.
 * consider making the default filesystem a file: URL so that all storage is really on the
physical host. It's often faster and preserves data better.

== Summary ==

You can bring up Hadoop in cloud infrastructures, and sometimes it makes sense, for development
and production. For production use, be aware that the differences between physical and virtual
infrastructures can threaten your data integrity and security - and you must plan for that.

View raw message