hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Anderson <...@monkey.org>
Subject Re: Difference between Hadoop Streaming and "Normal" mode
Date Wed, 20 Aug 2008 01:04:08 GMT

On 12-Aug-08, at 3:33 PM, John DeTreville wrote:

> I think you will find that the Streaming model buys you convenience,
> but costs you performance and generality. I'll let others quantify
> how much of each.

It looks to me like the streaming executable is fired up for each  
input split, whereas with Java, the same executable on a task box will  
handle several splits.  If this is true, then tasks with lots of setup  
involved before handling the input would be inefficient in streaming -  
a task box which received two splits would throw away and recompute  
the setup for the second one.

More appropriate partitioners would help this, but I think that by  
default it isn't considered a bad thing to have a mapper receive more  
than one split - it's more important to get the first splits to the  
mappers, get them running, and continue to send them splits.

I haven't had to look at any of this too closely, so if anyone can  
correct this, I'd appreciate the information.

View raw message