hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Kostyrka <andr...@kostyrka.org>
Subject Re: Hadoop streaming performance problem
Date Tue, 01 Apr 2008 02:43:58 GMT
Beg you pardon, Python is a fast language, although simple operations
are usually quite more expensive then in lower level languages, but at
least when used by somebody who has enough experience, that doesn't
matter to much. Actually, in many practical cases, because of project
deadlines, C++ (and to a lesser part Java) implementations end up with
the more naive algorithms/designs, and get beaten by Python. In some
cases this gets even stronger when the Python guys cheat and implement
the inner loop stuff with Pyrex/Cython. (That's observation is not
limited to Python, it usually applies to all higher level languages.
OTOH, I've beaten C++ in a quite 1:1 development race, so I tend to
write Python.)


Am Montag, den 31.03.2008, 18:10 -0700 schrieb Colin Evans:
> At Metaweb, we did a lot of comparisons between streaming (using Python) 
> and native Java, and in general streaming performance was not much 
> slower than the native java -- most of the slowdown was from Python 
> being a slow language. 
> The main problems with streaming apps that we found are that they are 
> hard to write and there are many ways that you can make simple mistakes 
> in streaming that slow down performance.
> We've been experimenting with embedding JavaScript (Rhino) and Jython 
> for writing jobs, and have found that performance is good and the apps 
> are much easier to write.  The tight Java integration means that 
> performance bottlenecks get rewritten in Java with little sacrifice to 
> development speed.  One of these days we'll open source these frameworks.
> Parand Darugar wrote:
> > Travis Brady wrote:
> >> This brings up two interesting issues:
> >>
> >> 1. Hadoop streaming is a potentially very powerful tool, especially for
> >> those of us who don't work in Java for whatever reason
> >> 2. If Hadoop streaming is "at best a jury rigged solution" then that 
> >> should
> >> be made known somewhere on the wiki.  If it's really not supposed to be
> >> used, why is it provided at all?
> >>   
> >
> > A set of reasonable performance tests and results would be very 
> > helpful in helping people decide whether to go with streaming or not. 
> > Hopefully we can get some numbers from this thread and publish them? 
> > Anyone else compared streaming with native java?
> >
> > Best,
> >
> > Parand

View raw message