hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@yahoo-inc.com>
Subject Re: Overhead of Java?
Date Thu, 06 Sep 2007 09:14:35 GMT
Hadoop has a lot of inefficiencies in it still.
Most of them are not related to the language choice.

If you look at what the per node tasks are doing (as opposed to the  
name node and job tracker) you will see that very little real work is  
being done by Hadoop Java code.

Pumbing bytes / io is done in library calls that are native code.   
The data node should be able to do its work in a fraction of a nodes  
CPU resource.

Your inner loop map-reduce code can be coded in C++ if you prefer.   
So I don't think the choice of Java represents a real performance  

Depending on your work load, the use of hadoop may represent a huge  
performance gain or loss.  You will need to benchmark it against your  
needs.   But I am not be too concerned about Java and we are running  
thousands of servers.


On Sep 6, 2007, at 1:37 AM, Torsten Curdt wrote:

> On 06.09.2007, at 09:56, Pietu Pohjalainen wrote:
> > Jeroen Verhagen wrote:
> >> On 9/5/07, Steve Schlosser <swschlosser@gmail.com> wrote:
> >>
> >>> question, but I was wondering if anyone has a reasonable  
> qualitative
> >>> answer that I can pass on when people ask.
> >>>
> >> Is this question really relevant since Hadoop is designed to run  
> on a
> >> cluster of commodity hardware Google-style? If there were any
> >> difference I'm sure it would be solved by adding 1 machine to the
> >> cluster.
> >>
> >
> >
> > Isn't it about whether to add 30% or 50% more machines? Which is
> > starting to get significant when you think whether to have 1000 or
> > 1500 machines.
> A plain java vs <some language> discussion is way to simple. I've
> been working on a java project that way (!!) out-performed a similar C
> ++ project. The design and a smart implementation will make more
> difference that just the plain language. Long running vs short
> running ..all what has already been said. At least that's my
> experience. That being said, for hadoop the one-child-jvm-per-job is
> what has quite a bit of an overhead. If you are not scared that your
> jobs will tear down your tasktrackers - we have an in-jvm execution
> patch. (not submitted yet though)
> cheers
> --
> Torsten

View raw message