hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saikat Kanjilal <sxk1...@hotmail.com>
Subject RE: Research projects for hadoop
Date Fri, 09 Sep 2011 17:34:20 GMT

How about using virtual box and centos 64 bit to serve as a linux container for isolating
map/reduce processes?  I have setup this up in the past, its really easy.

> From: evans@yahoo-inc.com
> To: mapreduce-dev@hadoop.apache.org
> Date: Fri, 9 Sep 2011 10:30:37 -0700
> Subject: Re: Research projects for hadoop
> The biggest issue with Xen and other virtualization technologies is that often there
is an IO penalty involved with using them.  For many jobs this is not an acceptable trade
off.  I do know, however, that there has been some discussion about using Linux Containers
for isolation of Map/Reduce processes.  I don't know if any JIRA has been filed for it or
not, but they are much lighter weight then Xen and other virtualization tech, because all
it really is concerned with is resource isolation, and not virtualizing an entire operating
> --Bobby Evans
> On 9/9/11 10:58 AM, "Saikat Kanjilal" <sxk1969@hotmail.com> wrote:
> Hi  Folks,I was looking through the following wiki page:  http://wiki.apache.org/hadoop/HadoopResearchProjects
and was wondering if there's been any work done (or any interest to do work) for the following
> Integration of Virtualization (such as Xen) with Hadoop toolsHow does one integrate sandboxing
of arbitrary user code in C++ and other languages in a VM such as Xen with the Hadoop framework?
How does this interact with SGE, Torque, Condor?As each individual machine has more and more
cores/cpus, it makes sense to partition each machine into multiple virtual machines. That
gives us a number of benefits:By assigning a virtual machine to a datanode, we effectively
isolate the datanode from the load on the machine caused by other processes, making the datanode
more responsive/reliable.With multiple virtual machines on each machine, we can lower the
granularity of hod scheduling units, making it possible to schedule multiple tasktrackers
on the same machine, improving the overall utilization of the whole clusters.With virtualization,
we can easily snapshot a virtual cluster before releasing it, making it possible to re-activate
the same cluster in the future and start to work from the snapshot.Provisioning of long running
Services via HODWork on a computation model for services on the grid. The model would include:Various
tools for defining clients and servers of the service, and at the least a C++ and Java instantiation
of the abstractionsLogical definitions of how to partition work onto a set of servers, i.e.
a generalized shard implementationA few useful abstractions like locks (exclusive and RW,
fairness), leader election, transactions,Various communication models for groups of servers
belonging to a service, such as broadcast, unicast, etc.Tools for assuring QoS, reliability,
managing pools of servers for a service with spares, etc.Integration with HDFS for persistence,
as well as access to local filesystemsIntegration with ZooKeeper so that applications can
use the namespace
> I would like to either help out with a design for the above or prototyping code, please
let me know if and what the process may be to move forward with this.
> Regards
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message