hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devin Suiter RDX <dsui...@rdx.com>
Subject Re: Estimating the time of my hadoop jobs
Date Tue, 17 Dec 2013 13:12:39 GMT
Nikhil,

One of the problems you run into with Hadoop in Virtual Machine
environments is performance issues when they are all running on the same
physical host. With a VM, even though you are giving them 4 GB of RAM, and
a virtual CPU and disk, if the virtual machines are sharing physical
components like processor and physical storage medium, they compete for
resources at the physical level. Even if you have the VM on a single host,
or on a multi-core host with multiple disks and they are sharing as few
resources as possible, there will still be a performance hit when the VM
information has to pass through the hypervisor layer - co-scheduling
resources with the host and things like that.

Does that make sense?

It's generally accepted that due to these issues, Hadoop in virtual
environments does not offer the same performance benefits as a physical
Hadoop cluster. It can be used pretty well with even low-quality hardware
though, so so, maybe you can acquire some used desktops and install your
favorite Linux flavor on them and make a cluster - some people have even
run Hadoop on Raspberry Pi clusters.


*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil <Nikhil.Kandoi@emc.com>wrote:

> I know this foolish of me to ask this, because there are a lot of factors
> that affect this,
>
> but why is it taking so much time, can anyone suggest possible reasons for
> it, or if anyone has faced such issue before
>
>
>
> Thanks,
>
> Nikhil Kandoi
>
> P.S – I am  Hadoop-1.0.3  for this application, so I wonder if this
> version has got something to do with it.
>
>
>
> *From:* Azuryy Yu [mailto:azuryyyu@gmail.com]
> *Sent:* Tuesday, December 17, 2013 4:14 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Estimating the time of my hadoop jobs
>
>
>
> Hi Kandoi,
>
> It depends on:
>
> how many cores on each VNode
>
> how complicated of your analysis application
>
>
>
> But I don't think it's normal spent 3hr to process 30GB data even on your
> *not good* hareware.
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Nikhil.Kandoi@emc.com>
> wrote:
>
> Hello everyone,
>
>
>
> I am new to Hadoop and would like to see if I’m on the right track.
>
> Currently I’m developing an application which would ingest logs of order
> of 60-70 GB of data/day and would then do
>
> Some analysis on them
>
> Now the infrastructure that I have is a 4 node cluster( all nodes on
> Virtual Machines) , all nodes have 4GB ram.
>
>
>
> But when I try to run the dataset (which is a sample dataset at this point
> ) of about 30 GB, it takes about 3 hrs to process all of it.
>
>
>
> I would like to know is it normal for this kind of infrastructure to take
> this amount of time.
>
>
>
>
>
> Thank you
>
>
>
> Nikhil Kandoi/
>
>
>

Mime
View raw message