Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of colinfreas@gmail.com
 designates 209.85.200.173 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version
         :content-type:references;
        b=ack3P20tST+of+E0QaqtRZ/qEsD2ul5a72+mzRmNUmOGspxePcWHcFy4PXElRbA/UK
         /hwZR/z1Ntvwr8/ZzTrKyTqRFJlqPvdA/8vVtoatvRRXiJUu7uMrxyDYhchSKjL4aPj9
         KNW7UpR+jzbpbYob8cTzXfqeX4oLddgOkOVAo=
Message-ID: <b27f65f70806061003h3fcd743auece79b17adff350c@mail.gmail.com>
Date: Fri, 6 Jun 2008 13:03:20 -0400
From: "Colin Freas" <colinfreas@gmail.com>
To: core-user@hadoop.apache.org
Subject: Re: Hadoop Distributed Virtualisation
In-Reply-To: <cbbf4b570806060941j422d42eak89834add7eaab04b@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_4807_802868.1212771800068"
References: <2d2102ba0806060730o3b2a7eb3m68886e5cf5973480@mail.gmail.com>
	 <b27f65f70806060803k8c7f54eu9673e6deb0825117@mail.gmail.com>
	 <2d2102ba0806060919w42e485b8t16ed836fbf040ab7@mail.gmail.com>
	 <cbbf4b570806060941j422d42eak89834add7eaab04b@mail.gmail.com>

------=_Part_4807_802868.1212771800068
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

The MR jobs I'm performing are not CPU intensive, so I've always assumed
that they're more IO bound.  Maybe that's an exceptional situation, but I'm
not really sure.

A good motherboard with a local IO channel per disk, feeding individual
cores, with memory partitioned up between them...  and I've heard good
things about Intel's next tock vis-a-vis internal system throughput.

And yes, this would be a task for a paravirtualization system like Xen.
Again, it's just a thought, but with low end quad core proc's running about
$300, and the potential to cut the number of machines you need to physically
setup by 75%, I'm not sure I'd say it'd only be good for a proof of
concept.

Also, I just set up a dozen odd boxes that are two generations behind modern
boxes, and promptly blew a fuse.  The TDP on the Xeon 3.06Ghz chips I'm
using is 89W.  The TDP on an Intel Q6600 is 65W, and it represents 4 cores.

It's a simple experiment, but I don't have the resources on hand to run it.
I'm curious if anyone has seen the performance impact from the different
setups we're talking about.  I also think you could come close to faking it
with Hadoop config changes.

-Colin


On Fri, Jun 6, 2008 at 12:41 PM, Edward Capriolo <edlinuxguru@gmail.com>
wrote:

> I once asked a wise man in change of a rather large multi-datacenter
> service, "Have you every considered virtualization?" He replied, "All
> the CPU's here are pegged at 100%"
>
> They may be applications for this type of processing. I have thought
> about systems like this from time to time. This thinking goes in
> circles. Hadoop is designed for storing and processing on different
> hardware.  Virtualization lets you split a system into sub-systems.
>
> Virtualization is great for proof of concept.
> For example, I have deployed this: I installed VMware with two linux
> systems on my windows host, I followed a hadoop multi-system-tutorial
> running on two vmware nodes. I was able to get the word count
> application working, I also confirmed that blocks were indeed being
> stored on both virtual systems and that processing was being shared
> via MAP/REDUCE.
>
> The processing however was slow, of course this is the fault of
> VMware. VMware has a very high emulation overhead. Xen has less
> overhead. LinuxVserver and OpenVZ use software virtualization (they
> have very little (almost no) overhead). Regardless of how much
> overhead, overhead is overhead. Personally I find the Vmware falls
> short of its promises
>

------=_Part_4807_802868.1212771800068--