hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kandoi, Nikhil" <Nikhil.Kan...@emc.com>
Subject RE: Estimating the time of my hadoop jobs
Date Wed, 18 Dec 2013 08:33:15 GMT
Thank you everyone for your solution ,

I think I got an idea of where I was making a mistake, not only was I setting up and destroying
the jvm for a single Hadoop jobs
I was also creating numerous Hadoop jobs for processing different files which can be handled
in one single job.

Will try the solution that I think would help solve the problem.


From: Shekhar Sharma [mailto:shekhar2581@gmail.com]
Sent: Tuesday, December 17, 2013 9:12 PM
To: user@hadoop.apache.org
Subject: Re: Estimating the time of my hadoop jobs

Apart from what Devin has suggested there are other factors which could be worth while noting
when you are running your hadoop cluster on virtual machines.

(1) How many map and reduce slots are there in cluster?

 Since you have not mentioned and you are using 4 node hadoop cluster so total of 8map slots
and 8 reduce slots are present.
What does it mean?
It means that at a time on your cluster only 8 map tasks and 8 reduce task will run parallely
and other task have to wait..

(2) Since you have not mentioned anywhere that whether 30GB of data is made up of lot of smaller
files ( less than block size) or bigger file...let us do a simple calculation assuming only
one file of 30GB and assuming a block size of 64MB

30GB = 30 * 1024 * 1024* 1024 = 32212254720

64MB = 64 * 1024*1024 =67108864

Total Number of blocks the data will be broken  = (32212254720) / (67108864) = 480 Blocks

Now this means you will be running 480 Map tasks ( keeping in mind inputsplit size = block
size)...But since you have only 8 map slots so at a time only 8 map task will run and others
will be pending...

Assuming all the 8map tasks finishes at one time then you will have 480/8 = 60 map waves

 (3) Now you know that each task runs on a separate JVM, that means to say for every task
a jvm is created and then after the task is finished the JVM is tear down..this is also a
bottle neck, creation and destroy of JVM

So try reusing the same JVM. There is option where in you can reuse the JVM

(4) SInce you are working with such  big data, try using combiner?

(5) Also try compressing the data and the intermediate output of the mappers and reducer op
   ---First try with sequence file
   ---Then try with snappy compression codec

By the above pointers if you can bring down the timings to atleast 1 hour or so..
Then with the same 4 node cluster and Hadoop running on separate physical machine you will
for sure see the job getting completed in 15-30minutes..[ Please refer Devin's comments ]

My suggestion is get the optimal performance on your virtual machine and then you go for real
hadoop cluster. You will for sure see the performance improvement

Som Shekhar Sharma

On Tue, Dec 17, 2013 at 6:42 PM, Devin Suiter RDX <dsuiter@rdx.com<mailto:dsuiter@rdx.com>>

One of the problems you run into with Hadoop in Virtual Machine environments is performance
issues when they are all running on the same physical host. With a VM, even though you are
giving them 4 GB of RAM, and a virtual CPU and disk, if the virtual machines are sharing physical
components like processor and physical storage medium, they compete for resources at the physical
level. Even if you have the VM on a single host, or on a multi-core host with multiple disks
and they are sharing as few resources as possible, there will still be a performance hit when
the VM information has to pass through the hypervisor layer - co-scheduling resources with
the host and things like that.

Does that make sense?

It's generally accepted that due to these issues, Hadoop in virtual environments does not
offer the same performance benefits as a physical Hadoop cluster. It can be used pretty well
with even low-quality hardware though, so so, maybe you can acquire some used desktops and
install your favorite Linux flavor on them and make a cluster - some people have even run
Hadoop on Raspberry Pi clusters.

Devin Suiter
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556<tel:412-256-8556> | www.rdx.com<http://www.rdx.com/>

On Tue, Dec 17, 2013 at 6:26 AM, Kandoi, Nikhil <Nikhil.Kandoi@emc.com<mailto:Nikhil.Kandoi@emc.com>>
I know this foolish of me to ask this, because there are a lot of factors that affect this,
but why is it taking so much time, can anyone suggest possible reasons for it, or if anyone
has faced such issue before

Nikhil Kandoi
P.S - I am  Hadoop-1.0.3  for this application, so I wonder if this version has got something
to do with it.

From: Azuryy Yu [mailto:azuryyyu@gmail.com<mailto:azuryyyu@gmail.com>]
Sent: Tuesday, December 17, 2013 4:14 PM
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: Estimating the time of my hadoop jobs

Hi Kandoi,
It depends on:
how many cores on each VNode
how complicated of your analysis application

But I don't think it's normal spent 3hr to process 30GB data even on your *not good* hareware.

On Tue, Dec 17, 2013 at 6:39 PM, Kandoi, Nikhil <Nikhil.Kandoi@emc.com<mailto:Nikhil.Kandoi@emc.com>>
Hello everyone,

I am new to Hadoop and would like to see if I'm on the right track.
Currently I'm developing an application which would ingest logs of order of 60-70 GB of data/day
and would then do
Some analysis on them
Now the infrastructure that I have is a 4 node cluster( all nodes on Virtual Machines) , all
nodes have 4GB ram.

But when I try to run the dataset (which is a sample dataset at this point ) of about 30 GB,
it takes about 3 hrs to process all of it.

I would like to know is it normal for this kind of infrastructure to take this amount of time.

Thank you

Nikhil Kandoi/

View raw message